Lait Equitable: A Data-Driven Approach to Fair Trade milk
Author
Jayesh Smith and Emeline Raimon-Dacunha-Castelle Urs Hurni
Published
May 28, 2024
Abstract
The following project focuses on the analysis of the Lait Equitable dataset, which contains information on the production of fair trade milk in Switzerland. The goal of this project is to analyze the dataset and identify trends and patterns in the data that can help us better understand the production of fair trade milk in Switzerland. We will use a variety of data analysis techniques, including exploratory data analysis, data visualization, and statistical modeling, to analyze the dataset and draw conclusions about the production of fair trade milk in Switzerland.
1 Introduction
1.1 Company Background
Lait Equitable, also known as Fair Milk, is a transformative Swiss initiative designed to ensure fair compensation for dairy producers across Switzerland. Founded in response to a prolonged crisis in the dairy industry, Lait Equitable aims to address the substantial disparity between the cost of milk production and the compensation received by producers. Historically, the plummeting prices of milk have often failed to cover production costs, leading to financial stress among producers. The cooperative structure of Lait Equitable ensures that each liter of milk is sold for 1 CHF, aligning closely with the actual production cost of approximately 98 cents per liter as estimated by Agridea in 2016.
Key Points:
Foundation: Lait Equitable was established to counteract the severe financial difficulties faced by Swiss milk producers due to inadequate compensation for their products. This initiative emerged as a critical response to the drastic reduction in the number of operational dairy farms, which saw a decline from about 44,000 producers 25 years ago to 17,164 by the end of 2023.
Operation: The initiative operates through the “Lait Equitable” cooperative. This entity is responsible for collecting, processing, and distributing milk, ensuring that producers receive fair compensation. It is noteworthy that while the cooperative guarantees 1 CHF per liter, the actual market price often falls significantly below this, making the cooperative’s top-up essential for fair compensation.
Product Range: Under the brand Faireswiss, Lait Equitable offers a range of dairy products available at various retail outlets. While the milk in Faireswiss products may not always directly come from the cooperative members due to logistical reasons, the fair compensation model remains intact.
Membership: The cooperative is inclusive, welcoming all Swiss milk producers, including those involved in specialty products like Gruyère cheese. While the cooperative initially included cheese producers, decisions made in recent general assemblies have temporarily restricted this inclusion, subject to future reassessment.
This initiative not only supports the livelihood of local farmers but also forms part of a broader European movement towards fair milk pricing, coordinated by the European Milk Board (EMB). Lait Equitable serves as a beacon of hope and a model for sustainable agricultural practices in the dairy industry, ensuring producers can live off their work dignifiedly.
Price Differences: To analyze the price differences between conventional and organic milk and their impact on market demand and supply.
Demand for Organic Milk: To study how the demand for organic milk has evolved in recent years and identify the driving factors behind this trend.
Sales Analysis Across Manors: To determine the factors contributing to the success of Lait Equitable’s milk in some Manor stores but not in others.
2 Data
2.1 Market for dairy products in Switzerland
The dairy market is an important sector of Swiss agriculture and the agri-food industry. The dairy industry accounts for around 20% of agricultural production.
Our data source is the Swiss Federal Office for Agriculture, OFAG. The dataset shows price trends on the dairy market at production level. The farmgate milk price is calculated on the basis of a monthly survey of the main milk buyers. These are the companies that buy milk directly from producers. International milk price trends are also shown for comparison purposes, we will look at this later.
We start by removing the columns with no value or unique value, and the variables that are not relevant for our analysis. After proceeding, we end up with a sub-data set of 6 variables.
Click to show code
# Importer les donnéesdata <-read.csv('../data/Données_marché_Lait.csv')# Faire une copie des donnéesclean_data <- data# Supprimer les colonnes avec des valeurs uniquescolumns_to_drop <-c('Devise', 'Composants.de.coûts', 'Devise', 'Type.de.données', 'Source.de.données', 'Groupe.de.produits', 'Commerce.extérieur', 'Indicateur', 'Marché', 'Propriétés.du.produit', "Mode.d.utilisation", 'Unité', 'Produit', 'Echelon.de.création.de.valeur', 'Echelon.de.création.de.valeur_Détail')clean_data <- clean_data %>%select(-all_of(columns_to_drop))# Convertir la colonne 'Date' en format Date (mois-année)clean_data$Date <-parse_date_time(clean_data$Date, orders ="my")# Renommer les colonnesclean_data <- clean_data %>%rename(`System of production`=`Système.de.production`,`Product origin`=`Provenance.du.produit`,`Product subgroup`=`Sous.groupe.de.produits`,`Sales region`=`Région.de.vente`,`Price`=`Prix` )# Changer les valeurs catégorielles dans les colonnes spécifiéesclean_data <- clean_data %>%mutate(`System of production`=recode(`System of production`, 'Conventionnel'='Conventional'),`Product origin`=recode(`Product origin`,'Région 1'='Region 1','Région 2'='Region 2','Région 3'='Region 3','Région 4'='Region 4','Région 5'='Region 5','Suisse'='Switzerland','Reste du monde, non suisse'='Rest of the world, non-Swiss','Nouvelle-Zélande'='New Zealand','Italie'='Italy','UE'='EU','Allemagne'='Germany','Autriche'='Austria'),`Product subgroup`=recode(`Product subgroup`,'Lait cru CH'='CH Milk ','Lait cru, International'='International Milk'), # Ajustez selon vos données`Sales region`=recode(`Sales region`,'Suisse'='Switzerland','Nouvelle-Zélande'='New Zealand','Italie'='Italy','UE-28'='EU-28','Allemagne'='Germany','Autriche'='Austria')) # Ajustez selon vos données# Afficher les données en utilisant reactablereactable(head(clean_data,1000), highlight =TRUE, # Highlight rows on hoverdefaultPageSize =10, # Display 10 rows per pagepaginationType ="numbers", # Use numbers for page navigationsearchable =TRUE, # Make the table searchablesortable =TRUE, # Allow sortingresizable =TRUE# Allow column resizing)
2.1.1.1 Description
The data covers the period from January 2001 to January 2024. It provides information of the type of system of production : conventional, bio, or IP Swiss production of milk. The origin and the sales region of the milk include Switzerland and its regions, as well as other international countries. This will enable us to carry out comparative analyses of Swiss and international milk prices. The ‘Price’ column represents the weighted average prices per quantity, mainly at the farm.
Click to show code
# Créez une table tibble avec des descriptions des variablesvariable_table <-tibble(Variable =c("Date", "System of production", "Product origin", "Product subgroup", "Sales region", "Price"),Description =c("The date when the data was recorded, in a year-month-day format.","The system of production for the product.","The origin of the product.","The subgroup to which the product belongs.","The region where the product is sold.","The price in centimes of the product." ))# Affichez la table avec kableExtravariable_table %>%kbl() %>%kable_styling(position ="center", bootstrap_options =c("striped", "bordered", "hover", "condensed"))
Variable
Description
Date
The date when the data was recorded, in a year-month-day format.
System of production
The system of production for the product.
Product origin
The origin of the product.
Product subgroup
The subgroup to which the product belongs.
Sales region
The region where the product is sold.
Price
The price in centimes of the product.
2.1.1.2 Missing Values
Then, we calculate the percentage of missing values for each characteristics and filter out those without.
Click to show code
# Calculer le pourcentage de NaN pour chaque colonnenan_percent <- clean_data %>%summarise(across(everything(), ~mean(is.na(.)) *100)) %>%pivot_longer(everything(), names_to ="Colonne", values_to ="Pourcentage_de_NaN")# Filtrer pour garder seulement les colonnes avec NaNnan_percent <- nan_percent %>%filter(Pourcentage_de_NaN >0)
The columns with missing values on our sub data set are: Sales region and System of production. The ‘System of production’ has a higher percentage of missing values, around 37.6%. However, this is a key variable in our analysis so we keep it and replace missing information ‘NaN’ with ‘Unknown’ for clarity. As for Sales region, the percentage of missing values is relatively low thus we keep it.
2.1.2 Swiss dairy products market
2.1.2.1 Wrangling & Cleaning
The main objective of this project is to focus on milk producers in Switzerland and more specifically on their remuneration. To do this, we are going to create a sub data set swiss_production_data that will only keep the Swiss regions in ‘Product origin’.
Click to show code
# Sélectionner les colonnes spécifiéesselected_columns <-c('Date', 'System of production', 'Product origin', 'Price')subset_data <- clean_data %>%select(all_of(selected_columns))# Définir les régions à conserverregions_to_keep <-c('Region 1', 'Region 2', 'Region 3', 'Region 4', 'Region 5')swiss_production_data <- subset_data %>%filter(`Product origin`%in% regions_to_keep)# Remplacer les valeurs manquantes par 'Unknown'swiss_production_data <- swiss_production_data %>%mutate(across(everything(), ~replace_na(., 'Unknown')))# Afficher les données en utilisant reactablereactable(head(swiss_production_data,1000),highlight =TRUE, # Highlight rows on hoverdefaultPageSize =10, # Display 10 rows per pagepaginationType ="numbers", # Use numbers for page navigationsearchable =TRUE, # Make the table searchablesortable =TRUE, # Allow sortingresizable =TRUE# Allow column resizing)
2.1.2.2 Description
Here is the geographical distribution of the 5 regions in Switzerland.
Click to show code
# Créez une table tibble avec des descriptions des variablestable <-tibble(Variable =c("Region 1", "Region 2", "Region 3", "Region 4", "Region 5"),Description =c("Geneva, Vaud, Fribourg, Neuchâtel, Jura and the French-speaking parts of the canton of Bern (administrative district of Bernese Jura).", "Bern, except the administrative district of Jura Bernois, Lucerne, Unterwalden : Obwalden. Nidwalden, Uri, Zug and part of the canton of Schwyz (district of Schwyz, Gersau and Küssnacht).","Basel-Landschaft and Basel-Stadt, Aargau and Solothurn","Zurich, Schaffhausen, Thurgau. Appenzell (Inner and Outer Rhodes), St. Gallen, part of Canton Schwyz (districts of Einsiedeln, March and Höfe), Glarus, Graubünden","Valais and Ticino" ))# Affichez la table avec kableExtratable %>%kbl() %>%kable_styling(position ="center", bootstrap_options =c("striped", "bordered", "hover", "condensed"))
Variable
Description
Region 1
Geneva, Vaud, Fribourg, Neuchâtel, Jura and the French-speaking parts of the canton of Bern (administrative district of Bernese Jura).
Region 2
Bern, except the administrative district of Jura Bernois, Lucerne, Unterwalden : Obwalden. Nidwalden, Uri, Zug and part of the canton of Schwyz (district of Schwyz, Gersau and Küssnacht).
Region 3
Basel-Landschaft and Basel-Stadt, Aargau and Solothurn
Region 4
Zurich, Schaffhausen, Thurgau. Appenzell (Inner and Outer Rhodes), St. Gallen, part of Canton Schwyz (districts of Einsiedeln, March and Höfe), Glarus, Graubünden
Region 5
Valais and Ticino
Click to show code
file_path <-"../data/"df_producteur <-read_excel(paste0(file_path, "lait_cru_producteur.xlsx"), sheet =1)df_producteur$date <-as.Date(df_producteur$date)library(kableExtra)# Create a tibble with variable descriptions for df_producteurvariable_table <-tibble(Variable =c("Date", "prix_bio", "prix_non_bio", "delta", "Delta_pourcent"),Description =c("The date when the prices were recorded, in a year-month-day format.","The recorded price of organic milk on the given date.","The recorded price of non-organic milk on the given date.","The absolute difference between the organic and non-organic milk prices.","The percentage difference between the organic and non-organic milk prices." ))# Display the table using kableExtravariable_table %>%kbl() %>%kable_styling(position ="center", bootstrap_options =c("striped", "bordered", "hover", "condensed"))
Variable
Description
Date
The date when the prices were recorded, in a year-month-day format.
prix_bio
The recorded price of organic milk on the given date.
prix_non_bio
The recorded price of non-organic milk on the given date.
delta
The absolute difference between the organic and non-organic milk prices.
Delta_pourcent
The percentage difference between the organic and non-organic milk prices.
2.1.3 Description
This data set explains the Milk price to producers over time. Essential for Macro Analysis of the milk industry in Switzerland.
Have a look :
Click to show code
# create a new data cleaneddf_producteur_show <- df_producteur %>%mutate(delta = prix_bio - prix_non_bio,delta_pourcent = (prix_bio - prix_non_bio) / prix_non_bio *100) %>%select(date, prix_bio, prix_non_bio, delta, delta_pourcent) %>%#round all column to 2 decimal placesmutate_if(is.numeric, round, 2) #print max and min values for delta_pourcent# max_delta_pourcent <- max(df_producteur_show$delta_pourcent, na.rm = TRUE)# max_delta_pourcent# min_delta_pourcent <- min(df_producteur_show$delta_pourcent, na.rm = TRUE)# min_delta_pourcent#display cleaned data using reactablelibrary(reactable)reactable( df_producteur_show, highlight =TRUE, # Highlight rows on hoverdefaultPageSize =10, # Display 10 rows per pagepaginationType ="numbers", # Use numbers for page navigationsearchable =TRUE, # Make the table searchablesortable =TRUE, # Allow sortingresizable =TRUE# Allow column resizing)
The dataset from 2023 was meticulously cleaned and standardized to ensure accuracy in our analysis. Initial steps included loading the data from an Excel file and renaming columns to reflect clearer, month-specific sales data for easier readability. Additionally, we corrected city names to maintain consistency across datasets. This included mapping various forms of city names to their standardized counterparts (e.g., ‘Bâle’ to ‘Basel’).
2.2.1.2 Description
The dataset is very light and contains monthly sales data for the year 2023. It is essential however for the analysis of the sales of Lait Equitable in different Manor stores.
Here is a preview of the cleaned and structured data:
Click to show code
#load python dfdf_sales_2023 <- py$df# Load necessary librarieslibrary(tibble)library(kableExtra)# Create a tibble with variable descriptions for df_manor_salesvariable_table <-tibble(Variable =c("Row Labels", "Monthly Columns (2023-01-01 to 2023-12-01)", "Grand Total"),Description =c("Identifies the Manor store location by name.","Each column represents sales figures for a specific month of 2023","Total sales across all months of 2023 for each location" ))# Display the table using kableExtravariable_table %>%kbl() %>%kable_styling(position ="center", bootstrap_options =c("striped", "bordered", "hover", "condensed"))
Variable
Description
Row Labels
Identifies the Manor store location by name.
Monthly Columns (2023-01-01 to 2023-12-01)
Each column represents sales figures for a specific month of 2023
Grand Total
Total sales across all months of 2023 for each location
Click to show code
# Using the provided column names correctly in the dataframe df_sales_2023df_sales_2023_show <- df_sales_2023 %>%# Ensure you convert the column names to standard ones if neededrename(Location =`Row Labels`) %>%# Correctly sum the monthly sales columns from Jan 2023 to Dec 2023mutate(Total_Sales =rowSums(select(., `Jan 2023`:`Dec 2023`), na.rm =TRUE)) %>%select(Location, `Jan 2023`:`Dec 2023`, Total_Sales) %>%mutate_if(is.numeric, round, 2) # round all numeric columns to 2 decimal places# Display the data using reactable for an interactive tablereactable( df_sales_2023_show, highlight =TRUE, # Highlight rows on hoverdefaultPageSize =10, # Display 10 rows per pagepaginationType ="numbers", # Use numbers for page navigationsearchable =TRUE, # Make the table searchablesortable =TRUE, # Allow sortingresizable =TRUE# Allow column resizing)
2.2.2 Dataset Sales 2022
2.2.2.1 Wrangling and Cleaning
Following the methodology established with the 2023 dataset, the 2022 sales data was similarly processed. The data from 2022, while structurally different, was also standardized to facilitate comparison and analysis. This included renaming columns to ensure uniformity in location names across both datasets.
Click to show code
# Load the data for 2022file_path_2022 ='../data/sales_2022.xlsx'df_2022 = pd.read_excel(file_path_2022)# Standardize city names based on the provided mappingcity_name_mapping = {'Bâle': 'Basel','Genève': 'Geneva','Bienne': 'Biel/Bienne','Chavannes': 'Chavannes-de-Bogis','Marin': 'Marin-Epagnier','Vesenaz': 'Vésenaz','Yverdon': 'Yverdon-les-Bains','Saint-Gall Webersbleiche': 'St. Gall'}# Rename columns to standardize city namesdf_2022.rename(columns=city_name_mapping, inplace=True)# Pivoting the table to get total sales per location for 2022, summing across all productssales_columns_2022 = [col for col in df_2022.columns if col notin ['Code article', 'Description article', 'Marque', 'Code Fournisseur', 'Description Fournisseur']]df_2022_total_sales = df_2022[sales_columns_2022].sum().reset_index()df_2022_total_sales.columns = ['Location', 'Total Sales 2022']
2.2.2.2 Description
The 2022 dataset, unlike the 2023 dataset, includes a variety of products, each recorded with sales figures across different locations. This dataset is notably less complex, focusing on total sales rather than monthly breakdowns, yet provides critical insights into the sales performance of different products.
Click to show code
# Load the 2022 sales datadf_sales_2022 <- py$df_2022# Load necessary librarieslibrary(tibble)library(kableExtra)# Create a tibble with variable descriptions for df_salesvariable_table <-tibble(Variable =c("Code article", "Description article", "Marque", "Code Fournisseur", "Description Fournisseur","Location Columns (e.g., Ascona-Delta, Baden, Bâle, etc.)"),Description =c("Unique identifier for each product.","Descriptive name of the product.","Brand of the product.","Supplier code.","Supplier name.","Each of these columns represents sales figures for that specific location." ))# Display the table using kableExtravariable_table %>%kbl() %>%kable_styling(position ="center", bootstrap_options =c("striped", "bordered", "hover", "condensed"))
# Extracting the total sales for 2023 from the first datasetdf_2023_total_sales = df[['Row Labels', 'Grand Total']].rename(columns={'Row Labels': 'Location', 'Grand Total': 'Total Sales 2023'})# Merging the 2022 and 2023 datasets on Locationmerged_sales_data = pd.merge(df_2022_total_sales, df_2023_total_sales, on='Location', how='outer')# Filling any NaN values that might have occurred due to locations present in one dataset and not the othermerged_sales_data.fillna(0, inplace=True)
Click to show code
# Load the merged sales datadf_merged_sales <- py$merged_sales_data#show it using reactablereactable( df_merged_sales, highlight =TRUE, # Highlight rows on hoverdefaultPageSize =10, # Display 10 rows per pagepaginationType ="numbers", # Use numbers for page navigationsearchable =TRUE, # Make the table searchablesortable =TRUE, # Allow sortingresizable =TRUE# Allow column resizing)
The 2022 sales data has been aggregated and standardized for each location. The merged dataset now shows the total sales per location for both 2022 and 2023. This dataset offers a comprehensive view of Lait Equitable’s sales dynamics over two consecutive years, highlighting trends and changes in consumer behavior across different locations.
2.2.3 Political Parties Dataset
2.2.3.1 Wrangling and Cleaning
The analysis starts by importing two datasets: sales data (annual sales of fair trade milk) and political party data (support percentages for major parties by location). The political data is cleaned to match commune names in the sales data and transformed into party presence percentages.
Next, the cleaned political data is merged with the sales data based on commune names. This merged dataset enables a combined analysis of political party presence and sales performance.
Click to show code
# Read sales data from Excelsales_data <-read_excel("../data/Ventes annuelles.xlsx")# Read political party data from Excelparty_data <-read_excel("../data/partisPolitiqueManor.xlsx")# Clean up party_data to match sales_data locationsparty_data_cleaned <- party_data %>%mutate(Location =gsub(" ", "", Location)) %>%filter(Location %in% sales_data$Location)# Update party names to match the new column namesparty_data_cleaned <- party_data_cleaned %>%mutate(PLR_Presence = PLR / (PLR + PS + UDC + Centre + Verts + Vertliberaux) *100,PS_Presence = PS / (PLR + PS + UDC + Centre + Verts + Vertliberaux) *100,UDC_Presence = UDC / (PLR + PS + UDC + Centre + Verts + Vertliberaux) *100,Centre_Presence = Centre / (PLR + PS + UDC + Centre + Verts + Vertliberaux) *100,Verts_Presence = Verts / (PLR + PS + UDC + Centre + Verts + Vertliberaux) *100,Vertliberaux_Presence = Vertliberaux / (PLR + PS + UDC + Centre + Verts + Vertliberaux) *100) %>%filter(PLR_Presence >0.2| PS_Presence >0.2| UDC_Presence >0.2| Centre_Presence >0.2| Verts_Presence >0.2| Vertliberaux_Presence >0.2)# Merge sales_data with updated party presence datamerged_data <-merge(sales_data, party_data_cleaned, by ="Location")
The analysis starts by importing one other dataset: revenue per capita per commune data. The revenue data is cleaned to match commune names in the sales data.
Next, the cleaned revenue data is merged with the sales data based on commune names. This merged dataset enables a combined analysis of revenue per capita per commune and sales performance.
Click to show code
# Load the datasetsrevenu_df <-read_excel("../data/revenuParContribuable_CommuneManor.xlsx")# Merge the datasets on the "Location" columnmerged_df <-inner_join(revenu_df, sales_data, by ="Location")# Clean the data and convert to numeric formatmerged_df$`Revenu/contribuable`<-as.numeric(gsub(" ", "", merged_df$`Revenu/contribuable`))merged_df$`2022`<-as.numeric(gsub(" ", "", merged_df$`2022`))merged_df$`2023`<-as.numeric(gsub(" ", "", merged_df$`2023`))
2.2.3.2 Description
Write kableextra to describe the dataset here
Click to show code
# Display the merged data using reactablereactable( merged_data, highlight =TRUE, # Highlight rows on hoverdefaultPageSize =10, # Display 10 rows per pagepaginationType ="numbers", # Use numbers for page navigationsearchable =TRUE, # Make the table searchablesortable =TRUE, # Allow sortingresizable =TRUE# Allow column resizing)
UTILISER CODE POUR DISPLAY DATA (JAYESH)
2.2.4 Exploratory data analysis
2.2.4.1 General analysis of dairy products market
In order to gain a better understanding of Lait Equitable sales, we’re going to start with a global analysis of the dairy market to get an idea of how milk prices work in Switzerland. The objective is to move from a macro-economic analysis to a micro-economic analysis with Fair Milk. The farm-gate price of milk is influenced by a number of factors, including milk production costs, the type of farming and the market situation in Switzerland and abroad, with demand tending to outstrip supply. The farmgate milk price is a weighted average of the prices actually paid to producers. Since 2017, the annual average farm gate milk price has risen steadily and continues its growth.
2.2.4.1.1 Distribution of swiss & international milk
There are 2 types of milk: international milk and CH milk in the column ‘Product subgroup’. It would be interesting to compare the percentages of these two types of milk in order to analyse whether the milk produced in Switzerland is significant or not.
Click to show code
# Lire les donnéesclean_data <-read.csv('../data/clean_data.csv')# Compter les occurrencesoccurrences <- clean_data %>%count(`Product.subgroup`) %>%rename(`Product.subgroup`=`Product.subgroup`, `Nombre d'occurrences`= n)# Créer l'histogramme interactif avec Plotlyfig <-plot_ly(data = occurrences, x =~`Product.subgroup`, y =~`Nombre d'occurrences`, type ='bar', text =~`Nombre d'occurrences`,textposition ='outside',hoverinfo ='text',hovertext =~paste('Frequency:', `Nombre d'occurrences`),marker =list(color =c('#87CEEB', '#90EE90')) # Couleurs douces) %>%layout(title ='Occurrence of "CH milk" and "international milk" products',xaxis =list(title ='Product subgroup'),yaxis =list(title ='Frequency'),margin =list(l =100, r =100, t =50, b =50) )# Afficher le graphiquefig
The histogram shows a higher percentage of CH milk than International Milk. This suggests that the Swiss dairy market is thriving.
2.2.4.1.2 Milk price trends by country
We are going to show price trends by country and carry out a macroeconomic analysis. The column ‘Product origin’ has 10 different origins, which are shown below.
Click to show code
# Lire les donnéesclean_data <-read.csv('../data/clean_data.csv')# Définir les valeurs à exclureregions_to_exclude <-c('Region 1', 'Region 2', 'Region 3', 'Region 4', 'Region 5')# Filtrer les données en excluant ces valeursfiltered_data <- clean_data %>%filter(!`Product.origin`%in% regions_to_exclude)# Convertir la colonne 'Date' en format Date si nécessairefiltered_data$Date <-ymd(filtered_data$Date)# Calculer le prix moyen par 'Product origin' et 'Date'average_price <- filtered_data %>%group_by(`Product.origin`, Date) %>%summarise(Average_Price =mean(Price, na.rm =TRUE), .groups ='drop') %>%ungroup()# Créer un graphique interactif avec Plotlyfig <-plot_ly(data = average_price, x =~Date, y =~Average_Price, color =~`Product.origin`, type ='scatter', mode ='lines')# Personnaliser la couleur pour la Suissefig <- fig %>%layout(title ='Average milk price by country over time',xaxis =list(title ='Year', tickformat ="%Y", dtick ="M12", tickangle =45),yaxis =list(title ='Average price'),legend =list(title =list(text ='Origin'), x =1.05, y =1),margin =list(l =50, r =50, t =50, b =50) )# Mettre en évidence la Suisse en rosefig <- fig %>%add_trace(data = average_price %>%filter(`Product.origin`=="Switzerland"),x =~Date, y =~Average_Price,type ='scatter', mode ='lines',line =list(color ='pink'),name ='Switzerland' )# Afficher le graphiquefig#> Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8#> Returning the palette you asked for with that many colors#> Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8#> Returning the palette you asked for with that many colors
We can see that Switzerland is far superior to other countries in terms of price, over the entire period observed, from 2001 to 2024. Given that Switzerland is the most expensive of all the countries listed, this is not surprising. However, the difference in centimes remains significant, up to twice the price or more.
2.2.4.2 Analysis of the Swiss dairy products market
The main objective of this project is to focus on milk producers in Switzerland and more specifically on their remuneration. Now we’re going to focus on Switzerland and its milk production. To do this, we are going to use our sub data set ‘swiss_production_data’.
2.2.4.2.1 Distribution of milk cost for Switzerland
Click to show code
# Charger le fichier CSVswiss_production_data <-read.csv('../data/swiss_production_data.csv')# Créer le graphique de densitégg_density <-ggplot(swiss_production_data, aes(x = Price)) +geom_density(fill ="skyblue", alpha =0.5) +labs(x ="Price in centimes",y ="Density",title ="Density Plot of Production Cost in Switzerland" ) +theme_minimal()# Convertir le graphique ggplot en graphique interactif avec plotlyplotly_density <-ggplotly(gg_density)# Afficher le graphique interactifplotly_density
This distribution, which covers all Swiss regions, gives an average price of around 75.25 centimes. The distribution is asymmetrical, with a more gradual rise to this value followed by an sharp fall.
Given that the sample size differs for each region, we represent the distribution of the cost of production for each region in terms of density to be able to compare them.
Click to show code
# Définir la nouvelle palette de couleurspalette <-c("red", "green", "blue", "orange", "purple")# Tri des régionsregions_sorted <-sort(unique(swiss_production_data$`Product.origin`))# Ajouter une colonne 'Color' pour assigner une couleur à chaque régionswiss_production_data <- swiss_production_data %>%mutate(Color =factor(`Product.origin`, levels = regions_sorted, labels = palette))# Créer le graphique de densitégg_density <-ggplot(swiss_production_data, aes(x = Price, color =`Product.origin`, fill =`Product.origin`)) +geom_density(alpha =0.5) +scale_fill_manual(values = palette) +scale_color_manual(values = palette) +labs(x ="Price in centimes",y ="Density",title ="Distribution of Price Density by Region",fill ="Sales regions",color ="Sales regions" ) +theme_minimal()# Convertir le graphique ggplot en graphique interactif avec plotlyplotly_density <-ggplotly(gg_density, tooltip =c("x", "y", "color"))# Afficher le graphique interactifplotly_density
The region 1 is the most expensive region in comparison to the others and the region 3 shows the lowest price in Switzerland.
2.2.4.2.2 Distribution of the production milk in kg by region
We will now analyse the distribution of kg of milk produced across the different regions. This will give us an overview of the regions with the highest production rates and therefore the most competitive regions in terms of production.
Click to show code
# Installer les packages nécessaires si ce n'est pas déjà faitif (!requireNamespace("ggplot2", quietly =TRUE)) install.packages("ggplot2")if (!requireNamespace("plotly", quietly =TRUE)) install.packages("plotly")# Charger les packageslibrary(ggplot2)library(plotly)# Compter les occurrences de chaque type de produitproduct_counts <-as.data.frame(table(swiss_production_data$`Product.origin`))names(product_counts) <-c("Product.origin", "Count")# Définir l'ordre désiré des régionsorder <-c('Region 1', 'Region 2', 'Region 3', 'Region 4', 'Region 5')# Réindexer les données dans l'ordre désiréproduct_counts$`Product.origin`<-factor(product_counts$`Product.origin`, levels = order)# Ajuster les couleurs pour les rendre plus clairesadjusted_colors <-c('Region 1'=adjustcolor('red', alpha.f =0.6),'Region 2'=adjustcolor('green', alpha.f =0.6),'Region 3'=adjustcolor('blue', alpha.f =0.6),'Region 4'=adjustcolor('orange', alpha.f =0.6),'Region 5'=adjustcolor('purple', alpha.f =0.6))# Créer un bar chart avec ggplot2p <-ggplot(product_counts, aes(x =`Product.origin`, y = Count, fill =`Product.origin`, text =paste("Product.origin:", `Product.origin`, "<br>Count:", Count))) +geom_bar(stat ="identity") +scale_fill_manual(values = adjusted_colors) +labs(title ='Distribution of milk production in Switzerland', x ='Product origin', y ='Number of Occurrences') +theme_minimal() +theme(legend.position ="none") +ylim(0, max(product_counts$Count) *1.2)# Rendre le graphique interactif avec plotlyfig <-ggplotly(p, tooltip =c("text"))# Afficher le graphiquefig
Milk production is evenly distributed between regions 1, 2 and 4. However, regions 3 and 5 show lower production, particularly region 3. This distribution of production can be explained by the location of the regions.
2.2.4.2.3 Average milk price per system of production over time
For the multivariate analysis, we’re going to focus on the production system and analyse the impact of the production system on price.
Click to show code
library(tidyverse)library(plotly)# Convertir la colonne 'Date' en datetimeswiss_production_data$Date <-as.Date(swiss_production_data$Date, format="%Y-%m-%d")# Calculer le prix moyen par système de production et par dateaverage_price <- swiss_production_data %>%group_by(`System.of.production`, Date) %>%summarise(Price =mean(Price, na.rm =TRUE), .groups ='drop')# Définir les couleurs pour chaque système de productioncolors <-c("Bio"="green", "Conventional"="orange", "Unknown"="blue")# Créer une liste de traces pour chaque système de productiontraces <-list()systems <-unique(average_price$`System.of.production`)for (system in systems) { subset <-filter(average_price, `System.of.production`== system) trace <-list(x = subset$Date,y = subset$Price,mode ='lines',name = system,type ='scatter',line =list(color = colors[system]) ) traces <-append(traces, list(trace))}# Créer la figurefig <-plot_ly(width =1000)for (trace in traces) { fig <-add_trace(fig, x = trace$x, y = trace$y, mode = trace$mode, name = trace$name, type = trace$type, line = trace$line)}# Définir les années uniques pour l'axe xyears <-unique(format(average_price$Date, "%Y"))# Ajouter des titres et des légendes, et ajuster les paramètres de l'axe x et yfig <-layout(fig,title ='Average Production Price by System of Production and Date',xaxis =list(title ='Date',tickmode ='array',tickvals =as.Date(paste0(years, "-01-01")), # Afficher toutes les annéesticktext = years,tickangle =-45,tickfont =list(size =10) ),yaxis =list(title ='Average Price in Centimes',range =c(48, 105) # Définir les limites de l'axe y ),legend =list(title =list(text ='System.of.Production')),margin =list(b =100) # Augmenter la marge inférieure pour les étiquettes pivotées)# Afficher le graphiquefig
An organic production system means higher prices as shown on the graph. The average price per kg of milk produced was higher in early 2001, around 83.5 cents for conventional milk production and almost 1.00 CHF for organic milk. The highest price recorded until now. The ‘unknown’ data are highly correlated with the ‘conventional’ data, the two curves are overlapping. There is a fall in prices until 2007, followed by a peak in 2008, which can be explained by the 2008 financial crisis. We note a seasonality and a trend patterns. The non-linear trend is upward since 2017. Does this trend will continue to go upward ? What are the future predictions for prices? We will answer this question in the Analysis section.
2.2.4.2.4 Average price of milk production, by region and production system
This histogram provides a more in-depth analysis of production systems by region.
Click to show code
# # Convertir la colonne 'Date' en années# swiss_production_data$Date <- as.numeric(format(as.Date(swiss_production_data$Date, format="%Y-%m-%d"), "%Y"))# # # Filtrer les données pour les années entre 2001 et 2024# swiss_production_data <- swiss_production_data %>%# filter(Date >= 2001 & Date <= 2024)# # # Calculer la moyenne des prix par 'Product origin', 'System of production' et 'Date'# average_price_data <- swiss_production_data %>%# group_by(`Product.origin`, `System.of.production`, Date) %>%# summarise(Avg_Price = mean(Price, na.rm = TRUE), .groups = 'drop')# # # Convertir 'Product origin' en facteur avec un ordre spécifique# average_price_data$`Product.origin` <- factor(average_price_data$`Product.origin`,# levels = c('Region 1', 'Region 2', 'Region 3', 'Region 4', 'Region 5'),# ordered = TRUE)# # # Créer le graphique animé# fig <- plot_ly(# data = average_price_data,# x = ~`Product.origin`,# y = ~Avg_Price,# color = ~`System.of.production`,# frame = ~Date,# type = 'bar'# )# # # Mettre à jour la mise en page du graphique# fig <- fig %>% layout(# title = 'Average price of milk production, by region and production system',# xaxis = list(title = 'Origin'),# yaxis = list(title = 'Average Price in centimes', range = c(0, 100)) # Ajuster la plage de l'axe des ordonnées# )# # # Ajouter les infobulles personnalisées# fig <- fig %>% style(# hovertemplate = 'Product origin: %{x}<br>Avg Price: %{y}'# )# # # Créer les étapes de l'animation pour chaque année# years <- unique(average_price_data$Date)# steps <- lapply(years, function(year) {# list(# method = 'animate',# label = as.character(year),# args = list(list(year), list(frame = list(duration = 300, redraw = TRUE), mode = 'immediate'))# )# })# # # Ajouter le slider au graphique# fig <- fig %>% layout(# sliders = list(list(steps = steps))# )# # # Supprimer les boutons du menu de mise à jour# fig <- fig %>% layout(# updatemenus = list(list(buttons = NULL))# )# # # Afficher le graphique# fig
An interesting insight is that the region 3 and 5 do not have an organic production system since 2001. They only produce conventional milk. Region 3 is once again the poorest, showing the lowest milk production at a lower price over the whole period and with only one production system.
2.2.5 Lait Equitable Products
To provide a comprehensive understanding of Lait Equitable’s sales trends throughout 2023, we performed a month-by-month sales analysis. This exploration helps identify seasonal effects, peak sales periods, and potential areas for strategic adjustments. Here’s a detailed breakdown of the approach and findings:
2.2.6 Sales Distribution Accross Months
Click to show code
# Remove 'Grand Total' column and the row labels columnmonthly_sales <- df_sales_2023 %>%select(-c(`Grand Total`, `Row Labels`))# Aggregate the sales per month across all locationstotal_sales_per_month <-colSums(monthly_sales)# Create a data frame for plottingmonthly_sales_df <-data.frame(Month =names(total_sales_per_month), Sales = total_sales_per_month)# Sort the data frame by Sales in descending ordersorted_monthly_sales_df <- monthly_sales_df %>%arrange(desc(Sales))# Plotting with ggplot2 using viridis color paletteggplot(sorted_monthly_sales_df, aes(x=reorder(Month, -Sales), y=Sales, fill=Month)) +geom_bar(stat="identity", show.legend =FALSE, fill ="#24918d", color ="black") +labs(title ="Total Sales by Month Across All Locations (2023)", x ="Month", y ="Total Sales") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))
The graph shows total monthly sales across all locations for the “Lait Equitable” in 2023. It shows the month with the highest sales, March and continuing to lower sales months. The least profitable month appears to be July.
Highest Sales in March: The graph starts with March, which shows the highest sales, almost reaching 25,000 units. This suggests that March was a particularly strong month for sales, possibly due to seasonal factors or specific marketing campaigns.
Gradual Decline in Sales: As we move from left to right, there is a general trend of declining sales. After March, the next highest sales are in December, followed by April, May, and so on. This indicates that sales in March were not sustained throughout the year.
Mid-year and End-Year Trends: While the graph is not in chronological order, it shows that some months like December (typically strong due to the holiday season) also performed well, but none reached the peak seen in March.
Lower Sales in the Latter Months Displayed: The months at the right end of the graph, such as June and July, show the lowest sales figures in the year. This could indicate a seasonal dip or other market dynamics affecting these months. One supposition could be that people are on vacations at these dates due to school vacations.
2.2.7 Sales Distribution Accross Locations
Click to show code
# First, we need to remove the 'Grand Total' column if it's includeddf <- df_sales_2023[, -ncol(df_sales_2023)]# Sum sales across all months for each locationtotal_sales_by_location <- df %>%mutate(Total_Sales =rowSums(select(., -`Row Labels`))) %>%select(`Row Labels`, Total_Sales)# Sort the locations by total sales in descending ordersorted_sales_by_location <- total_sales_by_location %>%arrange(desc(Total_Sales))# Plotting the data with ggplot2ggplot(sorted_sales_by_location, aes(x=reorder(`Row Labels`, Total_Sales), y=Total_Sales, fill=`Row Labels`)) +geom_bar(stat="identity", show.legend =FALSE, fill ="#24918d", color ="black") +labs(title ="Total Sales by Location (2023)", x ="Location", y ="Total Sales") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1, vjust=0.5)) # Rotate the x-axis text for better readability
The graph illustrates the total sales by location for Lait Equitable across various stores in 2023, organized from the lowest to the highest sales volume.
Variability in Sales Across Locations: The graph displays a significant variation in sales across different locations. The left side of the graph shows locations with the least sales, starting with Chur, Rapperswil, St. Gall, and progressively increasing towards the right.
Low Sales in Certain Areas: Locations like Chur, Rapperswil, and St. Gall have notably low sales, which could indicate either a lower demand for Lait Equitable’s products in these areas or possibly less effective marketing and distribution strategies.
High Sales in Specific Locations: The right end of the graph, particularly the last five locations, shows a sharp increase in sales. Notably, Vevey, Marin-Epagnier, Sierre and Monthey exhibit high sales, with Monthey being the highest. This might indicate a stronger market presence, better consumer acceptance, or more effective promotional activities in these regions.
Potential Market Strengths and Weaknesses: The graph effectively highlights where Lait Equitable is performing well and where there might be room for improvement. For instance, the high sales in cities like Sierre and Monthey suggest strong market penetration and acceptance.
Strategic Insights: For the Lait Equitable, this graph provides crucial data points for understanding which locations might need more focused marketing efforts or adjustments in distribution strategies. Additionally, it could help in identifying successful strategies in high-performing locations that could be replicated in areas with lower sales.
Click to show code
#remove grand totaldf <- df_sales_2023[, -ncol(df_sales_2023)]# Transform the data into a long format where each row contains a location, a month, and saleslong_data <- df %>%pivot_longer(cols =-`Row Labels`, names_to ="Month", values_to ="Sales") %>%mutate(Location =`Row Labels`)# Create a plotly object for an interactive boxplotfig <-plot_ly(long_data, x =~Location, y =~Sales, type ='box',hoverinfo ='text', text =~paste('Month:', Month, '<br>Sales:', Sales),marker =list(color ="#7e57c2",boxpoints ="all",jitter =0.3),box =list(line =list(color ="#24918d")) ) %>%layout(title ="Distribution of Monthly Sales Across Locations",xaxis =list(title ="Location"),yaxis =list(title ="Monthly Sales"),showlegend =FALSE, width=600,height =800) %>%config(displayModeBar =FALSE) # Optional: hide the mode bar#> Warning: Specifying width/height in layout() is now deprecated.#> Please specify in ggplotly() or plot_ly()# Display the plotfig#> Warning: 'box' objects don't have these attributes: 'box'#> Valid attributes include:#> 'alignmentgroup', 'boxmean', 'boxpoints', 'customdata', 'customdatasrc', 'dx', 'dy', 'fillcolor', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hoveron', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'jitter', 'legendgroup', 'legendgrouptitle', 'legendrank', 'line', 'lowerfence', 'lowerfencesrc', 'marker', 'mean', 'meansrc', 'median', 'mediansrc', 'meta', 'metasrc', 'name', 'notched', 'notchspan', 'notchspansrc', 'notchwidth', 'offsetgroup', 'opacity', 'orientation', 'pointpos', 'q1', 'q1src', 'q3', 'q3src', 'quartilemethod', 'sd', 'sdsrc', 'selected', 'selectedpoints', 'showlegend', 'stream', 'text', 'textsrc', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'upperfence', 'upperfencesrc', 'visible', 'whiskerwidth', 'width', 'x', 'x0', 'xaxis', 'xcalendar', 'xhoverformat', 'xperiod', 'xperiod0', 'xperiodalignment', 'xsrc', 'y', 'y0', 'yaxis', 'ycalendar', 'yhoverformat', 'yperiod', 'yperiod0', 'yperiodalignment', 'ysrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
This graphs show the variability in sales across different locations for each month in 2023. The boxplot provides a visual representation of the distribution of sales figures, highlighting the range, median, and outliers in each location.
We observe that the outliers are high or low sales months as we analyze previously. It confirms the previous analysis and provides a more detailed view of the sales distribution across locations.
2.2.8 Top Performing / Worse Performing Locations
Click to show code
#using grand total to sort the data from top to bottomdf <- df_sales_2023#using 'grand total' column as total sales plot the top and bottom locationsdf %>%arrange(desc(`Grand Total`)) %>%slice_head(n =5) %>%select(`Row Labels`, `Grand Total`) %>%ggplot(aes(x =reorder(`Row Labels`, `Grand Total`), y =`Grand Total`, fill =`Row Labels`)) +geom_bar(stat ="identity", show.legend =FALSE, fill ="#33848D", color ="black") +labs(title ="Top 5 Performing Locations by Total Sales (2023)", x ="Location", y ="Total Sales") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1, vjust=0.5)) # Rotate the x-axis text for better readability# worse performing locationsdf %>%arrange(`Grand Total`) %>%slice_head(n =5) %>%select(`Row Labels`, `Grand Total`) %>%ggplot(aes(x =reorder(`Row Labels`, `Grand Total`), y =`Grand Total`, fill =`Row Labels`)) +geom_bar(stat ="identity", show.legend =FALSE, fill ="#33848D", color ="black") +labs(title ="Bottom 5 Performing Locations by Total Sales (2023)", x ="Location", y ="Total Sales") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1, vjust=0.5)) # Rotate the x-axis text for better readability
As previously analyzed, the top and bottom performing locations are displayed in the bar charts. The top 5 locations with the highest total sales are shown in the first graph, while the bottom 5 locations with the lowest total sales are displayed in the second graph.
Top-performing locations are : Monthey, Sierre, Marin-Epagnier, Vevey, and Sion. Worse-performing locations are : Basel, St. Gall, Sargans, Rapperswil, and Chur.
2.2.9 2022 vs 2023
Click to show code
#plot a bar chart to compare the total sales in 2022 and 2023 and add transparency to the barsdf_merged_sales %>%ggplot(aes(x =reorder(Location, -`Total Sales 2023`), y =`Total Sales 2023`, fill ="2023")) +geom_bar(aes(x =reorder(Location, -`Total Sales 2022`), y =`Total Sales 2022`, fill ="2022"), stat ="identity", position ="dodge", fill ="#7e57c2", color ="black", alpha =0.7 ) +geom_bar(stat ="identity", position ="dodge", fill ="#33848D", color ="black", alpha =0.7) +labs(title ="Total Sales Comparison Between 2022 and 2023 by Location", x ="Location", y ="Total Sales") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1, vjust=0.5)) # Rotate the x-axis text for better readability
Trend of Decline: A significant number of Manor locations have lower sales figures in 2023 compared to 2022. This trend suggests that Lait Equitable might be facing challenges in these areas, which could include increased competition, changing consumer preferences, or other market dynamics affecting the demand for their products.
Monthey’s Decline: The bar chart shows that Monthey experienced a substantial decrease in sales in 2023 compared to 2022. This would be a key point of concern for Lait Equitable, and understanding why Monthey is underperforming is essential. This could be due to a range of factors, such as local economic conditions, operational challenges, increased competition, or changes in consumer preference within that particular area.
2.2.10 Map
We used the datasets about Lait Equitable Sales in order to show each product sales on a map.
We took the producers locations on the fairswiss.ch website in order to display them on a map. We also displayed the sales locations of the Lait Equitable products. The map shows the distribution of Lait Equitable sales locations and the proximity of these locations to the producers. This analysis helps identify potential patterns or correlations between sales and proximity to producers.
A Heatmap was then made in order to see the concentration of producers in Switzerland
You can play with the layer on the top right of the map to see how the sales change between each location for each product
# Function to calculate the dynamic radiusdef calculate_radius(volume, max_volume, min_volume, max_radius=20): normalized_volume = (volume - min_volume) / (max_volume - min_volume)return normalized_volume * max_radius +3# Function to get latitude and longitudedef get_lat_lon(city):try: time.sleep(1) # Simple rate-limiting mechanism location = geolocator.geocode(city +', Switzerland')return location.latitude, location.longitudeexceptAttributeError:returnNone, None# Read data from different product categoriesfile_paths = {'All Products': ("../data/Produits laitiers équitables - 2023.xlsb", 'Par SM'),'Milk Drink': ("../data/lait_drink_sales_per_stores_2023.xlsx", 'Sheet1'),'Milk Entier': ("../data/lait_entier_sales_per_stores_2023.xlsx", 'Sheet1'),'Fondue': ("../data/fondue_sales_per_stores_2023.xlsx", 'Sheet1'),'Delice': ("../data/delice_sales_per_stores_2023.xlsx", 'Sheet1'),'Creme': ("../data/creme_cafe_sales_per_stores_2023.xlsx", 'Sheet1')}# Create a folium mapm = folium.Map(location=[46.8182, 8.2275], zoom_start=8)# Instantiate the geolocatorgeolocator = Nominatim(user_agent="le_stores")#####map for store locations per products# Loop through each category for category, (file_path, sheet_name) in file_paths.items(): engine ='pyxlsb'if'xlsb'in file_path elseNone df = pd.read_excel(file_path, engine=engine, sheet_name=sheet_name)if category =='All Products':# Skip the first six rows and rename columns based on the provided structure df = df.iloc[6:] df.rename(columns={'Quantités vendues - année 2023': 'City','Unnamed: 1': '01/01/2023','Unnamed: 2': '02/01/2023','Unnamed: 3': '03/01/2023','Unnamed: 4': '04/01/2023','Unnamed: 5': '05/01/2023','Unnamed: 6': '06/01/2023','Unnamed: 7': '07/01/2023','Unnamed: 8': '08/01/2023','Unnamed: 9': '09/01/2023','Unnamed: 10': '10/01/2023','Unnamed: 11': '11/01/2023','Unnamed: 12': '12/01/2023','Unnamed: 13': 'Total General' }, inplace=True)else:# Renaming columns for XLSX files based on your last dataframe example df.rename(columns={ df.columns[0]: 'City', df.columns[-1]: 'Total General' }, inplace=True)# Standardize city names correct_city_names = {'Bâle': 'Basel','Genève': 'Geneva','Bienne': 'Biel/Bienne','Chavannes': 'Chavannes-de-Bogis','Marin': 'Marin-Epagnier','Vesenaz': 'Vésenaz','Yverdon': 'Yverdon-les-Bains','Saint-Gall Webersbleiche': 'St. Gall' } df['City'] = df['City'].apply(lambda x: correct_city_names.get(x, x))# Get latitudes and longitudes df[['Lat', 'Lon']] = df.apply(lambda row: pd.Series(get_lat_lon(row['City'])), axis=1)# Define color scale and feature group max_sales = df['Total General'].max() min_sales = df['Total General'].min() color_scale =cmp.linear.viridis.scale(min_sales, max_sales) fg = folium.FeatureGroup(name=category)# Add markersfor index, row in df.iterrows():if pd.notnull(row['Lat']) and pd.notnull(row['Lon']): radius = calculate_radius(row['Total General'], max_sales, min_sales) folium.CircleMarker( location=[row['Lat'], row['Lon']], radius=radius, popup=f"{row['City']}: {row['Total General']}", color=color_scale(row['Total General']), fill=True, fill_color=color_scale(row['Total General']) ).add_to(fg) fg.add_to(m)#> <folium.vector_layers.CircleMarker object at 0x00000209D3DF87A0>#> <folium.vector_layers.CircleMarker object at 0x00000209D1BDFB90>#> <folium.vector_layers.CircleMarker object at 0x00000209BF04CE00>#> <folium.vector_layers.CircleMarker object at 0x00000209D1BF6330>#> <folium.vector_layers.CircleMarker object at 0x00000209D4365C70>#> <folium.vector_layers.CircleMarker object at 0x00000209D4367EF0>#> <folium.vector_layers.CircleMarker object at 0x00000209D4367980>#> <folium.vector_layers.CircleMarker object at 0x00000209D43677D0>#> <folium.vector_layers.CircleMarker object at 0x00000209D4365490>#> <folium.vector_layers.CircleMarker object at 0x00000209D4366570>#> <folium.vector_layers.CircleMarker object at 0x00000209D4366540>#> <folium.vector_layers.CircleMarker object at 0x00000209D43646E0>#> <folium.vector_layers.CircleMarker object at 0x00000209D4364230>#> <folium.vector_layers.CircleMarker object at 0x00000209D4366870>#> <folium.vector_layers.CircleMarker object at 0x00000209D4367260>#> <folium.vector_layers.CircleMarker object at 0x00000209D43645F0>#> <folium.vector_layers.CircleMarker object at 0x00000209D4366420>#> <folium.vector_layers.CircleMarker object at 0x00000209D4365550>#> <folium.vector_layers.CircleMarker object at 0x00000209D19CA2A0>#> <folium.vector_layers.CircleMarker object at 0x00000209BEABD100>#> <folium.vector_layers.CircleMarker object at 0x00000209D427D220>#> <folium.vector_layers.CircleMarker object at 0x00000209D4367920>#> <folium.vector_layers.CircleMarker object at 0x00000209D1C22810>#> <folium.vector_layers.CircleMarker object at 0x00000209D4216780>#> <folium.vector_layers.CircleMarker object at 0x00000209D4216B40>#> <folium.vector_layers.CircleMarker object at 0x00000209D4228DA0>#> <folium.map.FeatureGroup object at 0x00000209C02591F0>#> <folium.vector_layers.CircleMarker object at 0x00000209D4301130>#> <folium.vector_layers.CircleMarker object at 0x00000209D1BDFB60>#> <folium.vector_layers.CircleMarker object at 0x00000209D3DC6FC0>#> <folium.vector_layers.CircleMarker object at 0x00000209D42CEE40>#> <folium.vector_layers.CircleMarker object at 0x00000209D164CE90>#> <folium.vector_layers.CircleMarker object at 0x00000209C0258560>#> <folium.vector_layers.CircleMarker object at 0x00000209D4364D10>#> <folium.vector_layers.CircleMarker object at 0x00000209D4365820>#> <folium.vector_layers.CircleMarker object at 0x00000209D4364EC0>#> <folium.vector_layers.CircleMarker object at 0x00000209D4364C50>#> <folium.vector_layers.CircleMarker object at 0x00000209D6385A00>#> <folium.vector_layers.CircleMarker object at 0x00000209D6385E50>#> <folium.vector_layers.CircleMarker object at 0x00000209D6385EE0>#> <folium.vector_layers.CircleMarker object at 0x00000209D6385CD0>#> <folium.vector_layers.CircleMarker object at 0x00000209D6386180>#> <folium.vector_layers.CircleMarker object at 0x00000209D6386AB0>#> <folium.vector_layers.CircleMarker object at 0x00000209D63862A0>#> <folium.vector_layers.CircleMarker object at 0x00000209D63860F0>#> <folium.vector_layers.CircleMarker object at 0x00000209D6386930>#> <folium.vector_layers.CircleMarker object at 0x00000209D6386B40>#> <folium.vector_layers.CircleMarker object at 0x00000209D6386BA0>#> <folium.vector_layers.CircleMarker object at 0x00000209D6386A50>#> <folium.vector_layers.CircleMarker object at 0x00000209D6386CC0>#> <folium.vector_layers.CircleMarker object at 0x00000209D6386F30>#> <folium.vector_layers.CircleMarker object at 0x00000209D6387110>#> <folium.vector_layers.CircleMarker object at 0x00000209D6387260>#> <folium.vector_layers.CircleMarker object at 0x00000209D6386DE0>#> <folium.map.FeatureGroup object at 0x00000209D43178F0>#> <folium.vector_layers.CircleMarker object at 0x00000209D63E8860>#> <folium.vector_layers.CircleMarker object at 0x00000209D63E8DD0>#> <folium.vector_layers.CircleMarker object at 0x00000209D63E9730>#> <folium.vector_layers.CircleMarker object at 0x00000209D63E8A10>#> <folium.vector_layers.CircleMarker object at 0x00000209D63E9940>#> <folium.vector_layers.CircleMarker object at 0x00000209D63E8FB0>#> <folium.vector_layers.CircleMarker object at 0x00000209D63E8E00>#> <folium.vector_layers.CircleMarker object at 0x00000209D63E90A0>#> <folium.vector_layers.CircleMarker object at 0x00000209D63EB290>#> <folium.vector_layers.CircleMarker object at 0x00000209D63E9490>#> <folium.vector_layers.CircleMarker object at 0x00000209D63E9850>#> <folium.vector_layers.CircleMarker object at 0x00000209D63EAE70>#> <folium.vector_layers.CircleMarker object at 0x00000209D63E9670>#> <folium.vector_layers.CircleMarker object at 0x00000209D63EB680>#> <folium.vector_layers.CircleMarker object at 0x00000209D63EAFC0>#> <folium.vector_layers.CircleMarker object at 0x00000209D63E9A30>#> <folium.vector_layers.CircleMarker object at 0x00000209D63EB170>#> <folium.vector_layers.CircleMarker object at 0x00000209D63EBF50>#> <folium.vector_layers.CircleMarker object at 0x00000209D63EB050>#> <folium.vector_layers.CircleMarker object at 0x00000209D63EB0B0>#> <folium.vector_layers.CircleMarker object at 0x00000209D63EB4D0>#> <folium.vector_layers.CircleMarker object at 0x00000209D63EB740>#> <folium.vector_layers.CircleMarker object at 0x00000209D63EB530>#> <folium.vector_layers.CircleMarker object at 0x00000209D63EB950>#> <folium.vector_layers.CircleMarker object at 0x00000209D63EBC20>#> <folium.vector_layers.CircleMarker object at 0x00000209D63EBAD0>#> <folium.vector_layers.CircleMarker object at 0x00000209D63EA630>#> <folium.map.FeatureGroup object at 0x00000209D63E9130>#> <folium.vector_layers.CircleMarker object at 0x00000209D63EA6C0>#> <folium.vector_layers.CircleMarker object at 0x00000209D63EA090>#> <folium.vector_layers.CircleMarker object at 0x00000209D63E9010>#> <folium.vector_layers.CircleMarker object at 0x00000209D63E9D00>#> <folium.vector_layers.CircleMarker object at 0x00000209D63EA660>#> <folium.vector_layers.CircleMarker object at 0x00000209D63EA870>#> <folium.vector_layers.CircleMarker object at 0x00000209D63E8590>#> <folium.vector_layers.CircleMarker object at 0x00000209D63BEF00>#> <folium.vector_layers.CircleMarker object at 0x00000209D63BF890>#> <folium.vector_layers.CircleMarker object at 0x00000209D63BF8C0>#> <folium.vector_layers.CircleMarker object at 0x00000209D63BF140>#> <folium.vector_layers.CircleMarker object at 0x00000209D63BFB60>#> <folium.vector_layers.CircleMarker object at 0x00000209D63BF740>#> <folium.vector_layers.CircleMarker object at 0x00000209D63BC680>#> <folium.vector_layers.CircleMarker object at 0x00000209D63BC6B0>#> <folium.vector_layers.CircleMarker object at 0x00000209D63BC080>#> <folium.vector_layers.CircleMarker object at 0x00000209D63BF470>#> <folium.vector_layers.CircleMarker object at 0x00000209D63BC4D0>#> <folium.vector_layers.CircleMarker object at 0x00000209D63BC0B0>#> <folium.vector_layers.CircleMarker object at 0x00000209D63BC2C0>#> <folium.vector_layers.CircleMarker object at 0x00000209D63BECF0>#> <folium.vector_layers.CircleMarker object at 0x00000209D63BC050>#> <folium.vector_layers.CircleMarker object at 0x00000209D63BC440>#> <folium.vector_layers.CircleMarker object at 0x00000209D63BC2F0>#> <folium.vector_layers.CircleMarker object at 0x00000209D63BCE60>#> <folium.vector_layers.CircleMarker object at 0x00000209D63BD340>#> <folium.map.FeatureGroup object at 0x00000209D63E9610>#> <folium.vector_layers.CircleMarker object at 0x00000209D3DFB230>#> <folium.vector_layers.CircleMarker object at 0x00000209D3DED1C0>#> <folium.vector_layers.CircleMarker object at 0x00000209D4302690>#> <folium.vector_layers.CircleMarker object at 0x00000209D63E8D40>#> <folium.vector_layers.CircleMarker object at 0x00000209D1BDFBF0>#> <folium.vector_layers.CircleMarker object at 0x00000209D63EA030>#> <folium.vector_layers.CircleMarker object at 0x00000209D63DD880>#> <folium.vector_layers.CircleMarker object at 0x00000209D63DDF40>#> <folium.vector_layers.CircleMarker object at 0x00000209D63DDD90>#> <folium.vector_layers.CircleMarker object at 0x00000209D63DF530>#> <folium.vector_layers.CircleMarker object at 0x00000209D63DDD60>#> <folium.vector_layers.CircleMarker object at 0x00000209D63DE300>#> <folium.vector_layers.CircleMarker object at 0x00000209D63DDDC0>#> <folium.vector_layers.CircleMarker object at 0x00000209D63DE030>#> <folium.vector_layers.CircleMarker object at 0x00000209D63DE060>#> <folium.vector_layers.CircleMarker object at 0x00000209D63DE4B0>#> <folium.vector_layers.CircleMarker object at 0x00000209D63DEF90>#> <folium.vector_layers.CircleMarker object at 0x00000209D63DEED0>#> <folium.vector_layers.CircleMarker object at 0x00000209D63DE8A0>#> <folium.vector_layers.CircleMarker object at 0x00000209D63DED80>#> <folium.vector_layers.CircleMarker object at 0x00000209D63DE8D0>#> <folium.vector_layers.CircleMarker object at 0x00000209D63DE930>#> <folium.vector_layers.CircleMarker object at 0x00000209D63DEE40>#> <folium.vector_layers.CircleMarker object at 0x00000209D63DF020>#> <folium.vector_layers.CircleMarker object at 0x00000209D63DF5F0>#> <folium.vector_layers.CircleMarker object at 0x00000209D63DFC80>#> <folium.vector_layers.CircleMarker object at 0x00000209D63DF0E0>#> <folium.map.FeatureGroup object at 0x00000209D6387CB0>#> <folium.vector_layers.CircleMarker object at 0x00000209C025B3B0>#> <folium.vector_layers.CircleMarker object at 0x00000209D63BD1C0>#> <folium.vector_layers.CircleMarker object at 0x00000209D63DF200>#> <folium.vector_layers.CircleMarker object at 0x00000209D63DDD00>#> <folium.vector_layers.CircleMarker object at 0x00000209D63DD310>#> <folium.vector_layers.CircleMarker object at 0x00000209D63DD850>#> <folium.vector_layers.CircleMarker object at 0x00000209D63DDAC0>#> <folium.vector_layers.CircleMarker object at 0x00000209D63DDA00>#> <folium.vector_layers.CircleMarker object at 0x00000209D65629F0>#> <folium.vector_layers.CircleMarker object at 0x00000209D65632C0>#> <folium.vector_layers.CircleMarker object at 0x00000209D6562990>#> <folium.vector_layers.CircleMarker object at 0x00000209D6562C90>#> <folium.vector_layers.CircleMarker object at 0x00000209D6562E40>#> <folium.vector_layers.CircleMarker object at 0x00000209D6563110>#> <folium.vector_layers.CircleMarker object at 0x00000209D6562AB0>#> <folium.vector_layers.CircleMarker object at 0x00000209D6562BD0>#> <folium.vector_layers.CircleMarker object at 0x00000209D6562ED0>#> <folium.vector_layers.CircleMarker object at 0x00000209D6563CB0>#> <folium.vector_layers.CircleMarker object at 0x00000209D65637D0>#> <folium.vector_layers.CircleMarker object at 0x00000209D6563290>#> <folium.vector_layers.CircleMarker object at 0x00000209D6563C20>#> <folium.vector_layers.CircleMarker object at 0x00000209D6563440>#> <folium.vector_layers.CircleMarker object at 0x00000209D6563A10>#> <folium.vector_layers.CircleMarker object at 0x00000209D6563800>#> <folium.vector_layers.CircleMarker object at 0x00000209D6563A40>#> <folium.vector_layers.CircleMarker object at 0x00000209D63EA4E0>#> <folium.vector_layers.CircleMarker object at 0x00000209D63EB770>#> <folium.map.FeatureGroup object at 0x00000209D6387D70>###heatmap of producersheat_data = [[float(lat), float(lon)] for loc in locations for lat, lon in [loc.split(',')]]# Add HeatMap layerHeatMap(heat_data).add_to(m)#> <folium.plugins.heat_map.HeatMap object at 0x00000209D63DD700># Add layer control and save the mapfolium.LayerControl().add_to(m)#> <folium.map.LayerControl object at 0x00000209D65621E0>m.save('combined_product_map.html')m
Make this Notebook Trusted to load map: File -> Trust Notebook
2.3 Price To Producers Lait Cru
2.3.1 Organic Milk vs Non Organic (bio) Milk
Click to show code
# Create xts objectprices_xts <-xts(df_producteur[, c("prix_bio", "prix_non_bio")], order.by = df_producteur$date)# Plot using dygraphsdygraph(prices_xts, main ="Trends in Milk Prices (Organic vs. Non-Organic)", width ="600px", height ="400px") %>%dySeries("prix_bio", label ="Organic Price", color ="#24918d") %>%dySeries("prix_non_bio", label ="Non-Organic Price", color ="#7e57c2") %>%dyOptions(stackedGraph =FALSE) %>%dyRangeSelector(height =20)
Click to show code
# Create an xts object for the delta series, ensuring the series name is retaineddelta_xts <-xts(x = df_producteur[,"delta", drop =FALSE], order.by = df_producteur$date)# Plot using dygraphsdf_p_delta <-dygraph(delta_xts, main ="Difference in Prices Between Organic and Non-Organic Milk Over Time", width ="600px", height ="400px") %>%dySeries("delta", label ="Delta in Price", color ="#24918d") %>%dyOptions(stackedGraph =FALSE) %>%dyRangeSelector(height =20)# Print the dygraph to display itp_delta
2.3.2 Seasonality
Click to show code
# Process the data to extract month and yeardf_producteur <- df_producteur %>%mutate(Month =format(date, "%m"),Year =format(date, "%Y")) %>%arrange(date) # Ensure data is in chronological order# Plotting the data with ggplot2, showing the trend within each yearp_seaso_2 <-ggplot(df_producteur, aes(x = Month, y = prix_bio, group = Year, color =as.factor(Year))) +geom_smooth(se =FALSE, method ="loess", span =0.3, size =0.7) +labs(title ="Monthly Milk Prices by Year",x ="Month",y ="Price of Organic Milk",color ="Year") +theme_minimal() +scale_color_viridis_d() +theme(legend.position ="bottom", axis.text.x =element_text(angle =45, hjust =1))#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.#> i Please use `linewidth` instead.# Convert to an interactive plotly objectinteractive_plot_seaso_2 <-ggplotly(p_seaso_2, width =600, height =400)#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : span too small. fewer data values than degrees of#> freedom.#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : at 6.975#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : radius 0.000625#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : all data on boundary of neighborhood. make span bigger#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : pseudoinverse used at 6.975#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : neighborhood radius 0.025#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : reciprocal condition number 1#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : at 12.025#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : radius 0.000625#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : all data on boundary of neighborhood. make span bigger#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : There are other near singularities as well. 0.000625#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : zero-width neighborhood. make span bigger#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : zero-width neighborhood. make span bigger#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : zero-width neighborhood. make span bigger#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : zero-width neighborhood. make span bigger#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : zero-width neighborhood. make span bigger#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : zero-width neighborhood. make span bigger#> Warning: Failed to fit group 1.#> Caused by error in `predLoess()`:#> ! NA/NaN/Inf in foreign function call (arg 5)#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : span too small. fewer data values than degrees of#> freedom.#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : pseudoinverse used at 0.945#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : neighborhood radius 2.055#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : reciprocal condition number 0#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : There are other near singularities as well. 4.223#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : span too small. fewer data values than degrees of#> freedom.#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : pseudoinverse used at 0.945#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : neighborhood radius 2.055#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : reciprocal condition number 0#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : There are other near singularities as well. 4.223#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : span too small. fewer data values than degrees of#> freedom.#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : pseudoinverse used at 0.945#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : neighborhood radius 2.055#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : reciprocal condition number 0#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : There are other near singularities as well. 4.223#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : span too small. fewer data values than degrees of#> freedom.#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : pseudoinverse used at 0.945#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : neighborhood radius 2.055#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : reciprocal condition number 0#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : There are other near singularities as well. 4.223#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : span too small. fewer data values than degrees of#> freedom.#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : pseudoinverse used at 0.945#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : neighborhood radius 2.055#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : reciprocal condition number 0#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : There are other near singularities as well. 4.223#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : span too small. fewer data values than degrees of#> freedom.#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : pseudoinverse used at 0.95#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : neighborhood radius 2.05#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : reciprocal condition number 0#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : There are other near singularities as well. 4.2025# Adjust plotly settings interactive_plot_seaso_2 <- interactive_plot_seaso_2 %>%layout(margin =list(l =40, r =10, b =40, t =40), # Adjust marginslegend =list(orientation ="h", x =0, xanchor ="left", y =-0.2)) # Adjust legend position# Display the interactive plotinteractive_plot_seaso_2
3 Analysis
Answers to the research questions
Different methods considered
Competing approaches
Justifications
4 Decomposition of Milk Price in Switzerland
Click to show code
# Charger les packages nécessaireslibrary(tidyverse)library(lubridate)library(forecast)library(ggplot2)# Importer les donnéesswiss_decomposition <-read_csv('../data/swiss_production_data.csv', show_col_types =FALSE)
Click to show code
# Assurez-vous que les dates sont de type Date et que 'Date' est l'indexswiss_decomposition <- swiss_decomposition %>%mutate(Date =as.Date(Date, format ="%Y-%m-%d")) %>%arrange(Date)# Calculez la moyenne des prix par datedata <- swiss_decomposition %>%group_by(Date =floor_date(Date, "month")) %>%summarise(Price =mean(Price, na.rm =TRUE))# Convertir en série temporelledata_ts <-ts(data$Price, start =c(year(min(data$Date)), month(min(data$Date))), frequency =12)# Appliquez la décomposition classiquedecomposition <-decompose(data_ts, type ="additive") # ou 'multiplicative' selon le cas# Créez un graphique avec des subplotspar(mfrow =c(4, 1), mar =c(3, 3, 2, 1), oma =c(1, 1, 1, 1))# Plot the original dataplot(data_ts, main ="Original Data", ylab ="Prix")# Plot the trend componentplot(decomposition$trend, main ="Trend Component", ylab ="Trend")# Plot the seasonal componentplot(decomposition$seasonal, main ="Seasonal Component", ylab ="Seasonal")# Plot the residual componentplot(decomposition$random, main ="Residual Component", ylab ="Residual")# Ajuster le layoutpar(mfrow =c(1, 1))
4.1 Overall SARIMA forecast
Click to show code
# Assurez-vous que les dates sont de type Date et que 'Date' est l'indexlibrary(dplyr)library(lubridate)library(forecast)library(ggplot2)swiss_decomposition <- swiss_decomposition %>%mutate(Date =as.Date(Date, format ="%Y-%m-%d")) %>%arrange(Date)# Calculez la moyenne des prix par date (par mois)data <- swiss_decomposition %>%group_by(Date) %>%summarise(Price =mean(Price, na.rm =TRUE), .groups ='drop')# Convertir en série temporelledata_ts <-ts(data$Price, start =c(year(min(data$Date)), month(min(data$Date))), frequency =12)# Ajustement du modèle SARIMA avec auto.arima pour sélectionner les meilleurs paramètressarima_model <-auto.arima(data_ts, seasonal =TRUE, stepwise =FALSE, approximation =FALSE)# Prévision avec intervalles de confiance de 80 % et 95 %sarima_forecast <-forecast(sarima_model, h =12, level =c(80, 95))# Créer des dataframes pour les prévisions et les intervalles de confiancedata_forecast <-data.frame(Date =seq.Date(from =max(data$Date) +months(1), by ="month", length.out =12),Forecast =as.numeric(sarima_forecast$mean),Lower80 = sarima_forecast$lower[,1],Upper80 = sarima_forecast$upper[,1],Lower95 = sarima_forecast$lower[,2],Upper95 = sarima_forecast$upper[,2])# Tracer les résultats avec ggplot2ggplot() +geom_line(data = data, aes(x = Date, y = Price), color ="blue", size =0.5) +geom_line(data = data_forecast, aes(x = Date, y = Forecast), color ="red", size =0.5) +geom_ribbon(data = data_forecast, aes(x = Date, ymin = Lower80, ymax = Upper80), fill ="blue", alpha =0.2) +geom_ribbon(data = data_forecast, aes(x = Date, ymin = Lower95, ymax = Upper95), fill ="blue", alpha =0.1) +scale_x_date(date_breaks ="5 years", date_labels ="%Y") +labs(title ='Overall SARIMA Forecast', x ='Date', y ='Average Price') +theme_minimal() +theme(plot.title =element_text(hjust =0.5))
The SARIMA model used provides a good approximation of historical trends and provides reasonable forecasts for 2024. However, there are a few points to consider: - relative price stability in the forecast despite historical variability. - confidence intervals show that there is some uncertainty around forecasts, indicating that it is prudent to consider a range of possible outcomes.
4.2 SARIMA Forecast for “Bio” and “Conventional”
We saw that our data showed a significant seasonal trend. The SARIMA model takes into account the seasonality of time series in its predictions. This is why the SARIMA model would be better suited to our data.
Click to show code
# Assurez-vous que les dates sont de type Date et que 'Date' est l'indexswiss_decomposition <- swiss_decomposition %>%mutate(Date =as.Date(Date, format ="%Y-%m-%d")) %>%arrange(Date)# Filtrer les données pour les systèmes de production 'Bio' et 'Conventional'data_bio <- swiss_decomposition %>%filter(`System of production`=='Bio') %>%group_by(Date =floor_date(Date, "month")) %>%summarise(Price =mean(Price, na.rm =TRUE)) %>%ungroup()data_conventional <- swiss_decomposition %>%filter(`System of production`=='Conventional') %>%group_by(Date =floor_date(Date, "month")) %>%summarise(Price =mean(Price, na.rm =TRUE)) %>%ungroup()# Convertir en série temporellets_bio <-ts(data_bio$Price, start =c(year(min(data_bio$Date)), month(min(data_bio$Date))), frequency =12)ts_conventional <-ts(data_conventional$Price, start =c(year(min(data_conventional$Date)), month(min(data_conventional$Date))), frequency =12)# Ajuster les modèles SARIMAfit_bio_sarima <-auto.arima(ts_bio, seasonal =TRUE, stepwise =FALSE, approximation =FALSE)fit_conventional_sarima <-auto.arima(ts_conventional, seasonal =TRUE, stepwise =FALSE, approximation =FALSE)# Prévisions pour les 12 prochains moisforecast_bio_sarima <-forecast(fit_bio_sarima, h =12)forecast_conventional_sarima <-forecast(fit_conventional_sarima, h =12)# Tracer les prévisionsautoplot(forecast_bio_sarima) +labs(title ="Forecasted Prices of Organic Milk (SARIMA)", x ="Date", y ="Average Price") +theme_minimal()autoplot(forecast_conventional_sarima) +labs(title ="Forecasted Prices of Non-Organic Milk (SARIMA)", x ="Date", y ="Average Price") +theme_minimal()
For Organic Milk, price forecasts show a stable but slightly rising general trend. For Conventional Milk, price forecast shows a slight initial decline followed by stabilization. Confidence intervals for both types of milk indicate similar uncertainty around forecasts, with areas of moderate and high uncertainty.
5 Exponential smoothing
Click to show code
# Charger les packages nécessaireslibrary(tidyverse)library(lubridate)library(forecast)library(ggplot2)# Filtrer les données pour les systèmes de production 'Bio' et 'Conventional'data_bio <- swiss_decomposition %>%filter(`System of production`=='Bio') %>%group_by(Date =floor_date(Date, "month")) %>%summarise(Price =mean(Price, na.rm =TRUE)) %>%ungroup()data_conventional <- swiss_decomposition %>%filter(`System of production`=='Conventional') %>%group_by(Date =floor_date(Date, "month")) %>%summarise(Price =mean(Price, na.rm =TRUE)) %>%ungroup()# Convertir en série temporellets_bio <-ts(data_bio$Price, start =c(year(min(data_bio$Date)), month(min(data_bio$Date))), frequency =12)ts_conventional <-ts(data_conventional$Price, start =c(year(min(data_conventional$Date)), month(min(data_conventional$Date))), frequency =12)# Ajuster les modèles de lissage exponentiel (ETS)fit_bio_ets <-ets(ts_bio)fit_conventional_ets <-ets(ts_conventional)# Prévisions pour les 12 prochains moisforecast_bio_ets <-forecast(fit_bio_ets, h =12)forecast_conventional_ets <-forecast(fit_conventional_ets, h =12)# Tracer les prévisionsautoplot(forecast_bio_ets) +labs(title ="Forecasted Prices of Organic Milk (ETS)", x ="Date", y ="Average Price") +theme_minimal()autoplot(forecast_conventional_ets) +labs(title ="Forecasted Prices of Non-Organic Milk (ETS)", x ="Date", y ="Average Price") +theme_minimal()
For organic milk, the price forecasts show a stable overall trend, with a slight increase. For non organic milk, the price Forecasts show a slight initial fall, followed by stabilization.
6 Holt-Winters - a voir si on garde
We apply the Holt-Winters model to predict the price over the next 12 months. This model is an extension of the simple exponential smoothing model. It takes trend and seasonality into account in its predictions for its two time series, since these two components play a crucial role as we have seen with the decomposition. We took seasonal = “multiplicative” because the multiplicative model shows lower AIC values, indicating a better fit than additive models.
Click to show code
# Charger les packages nécessaireslibrary(forecast)library(ggplot2)# Convertir en série temporellets_bio <-ts(data_bio$Price, start =c(year(min(data_bio$Date)), month(min(data_bio$Date))), frequency =12)ts_conventional <-ts(data_conventional$Price, start =c(year(min(data_conventional$Date)), month(min(data_conventional$Date))), frequency =12)# Ajuster le modèle de Holt-Winters (multiplicatif pour la saisonnalité multiplicative)fit_bio_hw <-hw(ts_bio, seasonal ="multiplicative")fit_conventional_hw <-hw(ts_conventional, seasonal ="multiplicative")# Prévisions pour les 12 prochains moisforecast_bio_hw <-forecast(fit_bio_hw, h =12)forecast_conventional_hw <-forecast(fit_conventional_hw, h =12)# Tracer les prévisionsautoplot(forecast_bio_hw) +labs(title ="Forecasted Prices of Organic Milk (Holt-Winters Multiplicative)", x ="Date", y ="Average Price") +theme_minimal()autoplot(forecast_conventional_hw) +labs(title ="Forecasted Prices of Non-Organic Milk (Holt-Winters Multiplicative)", x ="Date", y ="Average Price") +theme_minimal()
The Holt-Winters model predicts a higher price increase for organic milk. According to this graph, the price would rise to CHF 1 for the producer. The confidence interval rises to CHF 1.10, but does not fall below 87 centimes. For conventional milk, the forecasts fall less and rise more compared with the ETS model. For organic milk, the price forecasts show a stable overall trend, with a slight increase. For non organic milk, the price Forecasts show a slight initial fall, followed by stabilization.
7 Comparison of all forecastings
All four models effectively capture historical trends in organic and non-organic milk prices. Forecasts for 2024 are similar between the models, indicating upward trends for organic milk and stabilization for non-organic milk. Confidence intervals for all four models are similar, indicating moderate to high uncertainty. We observe that there are no major differences between the SARIMA, SARIMAX, ETS, and Holt-Winters Multiplicative models in the organic and non-organic milk price forecasts for 2024.
7.1 Forecasting Next Year Milk Prices
Click to show code
# re-arragen the df_producteur data in ascending orderdf_producteur <- df_producteur[order(df_producteur$date),]#creating tsibble for organic and non-organic milk pricesdf_producteur_ts_non_bio <-ts(df_producteur$prix_non_bio, start=c(2017, 12), frequency=12)df_producteur_ts_bio <-ts(df_producteur$prix_bio, start=c(2017, 12), frequency=12)#convert the ts object to a tsiible objectdf_producteur_ts_non_bio <-as_tsibble(df_producteur_ts_non_bio)df_producteur_ts_bio <-as_tsibble(df_producteur_ts_bio)
7.1.1 Naive Forecast
Click to show code
# Fit a naive modelfit_non_bio <- df_producteur_ts_non_bio %>%model(naive =NAIVE(value))fit_bio <- df_producteur_ts_bio %>%model(naive =NAIVE(value))# Forecast the next 12 monthsnaive_forecast_non_bio <- fit_non_bio %>%forecast(h =12)naive_forecast_bio <- fit_bio %>%forecast(h =12)plot <- naive_forecast_non_bio %>%autoplot(df_producteur_ts_non_bio, alpha =0.5) +labs(title ="Naive Forecast of Non-Organic Milk Prices",x ="Date",y ="Price") +guides(colour =guide_legend(title ="Forecast"))plotplot <- naive_forecast_bio %>%autoplot(df_producteur_ts_bio, alpha =0.5) +labs(title ="Naive Forecast of Non-Organic Milk Prices",x ="Date",y ="Price") +guides(colour =guide_legend(title ="Forecast"))plot
We observe that this model is very vague because is just a naive model that assumes that the next value will be the same as the last value. But it gives us a starting point to compare with other models.
7.1.2 ARIMA Model
7.1.2.1 Stationarity
Click to show code
# re-arragen the df_producteur data in ascending orderdf_producteur <- df_producteur[order(df_producteur$date),]#creating tsibble for organic and non-organic milk pricesdf_producteur_ts_non_bio <-ts(df_producteur$prix_non_bio, start=c(2017, 12), frequency=12)df_producteur_ts_bio <-ts(df_producteur$prix_bio, start=c(2017, 12), frequency=12)#check for stationarityadf.test(df_producteur_ts_non_bio)#> #> Augmented Dickey-Fuller Test#> #> data: df_producteur_ts_non_bio#> Dickey-Fuller = -3, Lag order = 4, p-value = 0.08#> alternative hypothesis: stationaryadf.test(df_producteur_ts_bio)#> #> Augmented Dickey-Fuller Test#> #> data: df_producteur_ts_bio#> Dickey-Fuller = -3, Lag order = 4, p-value = 0.08#> alternative hypothesis: stationary
We analyse the stationarity of the time series data for both organic and non-organic milk prices using the Augmented Dickey-Fuller (ADF) test. The Augmented Dickey-Fuller (ADF) test is commonly used to determine whether a unit root is present in a time series dataset. A unit root suggests that a time series is non-stationary, meaning its statistical properties such as mean and variance change over time. On the other hand, if the null hypothesis of the ADF test is rejected, it indicates that the time series is stationary.
We do that because ARIMA models require the time series data to be stationary.
The results of the ADF test for both time series df_producteur_ts_non_bio and df_producteur_ts_bio indicate:
Dickey-Fuller statistic value of -3
Lag order of 4
p-value of 0.08
Solely based on the p-values provided (0.08), we cannot conclusively determine whether the time series data df_producteur_ts_non_bio and df_producteur_ts_bio are stationary or not. They might be stationary, but further analysis or additional tests might be needed for a more definitive conclusion.
We can thus differenciate the data to make it stationary.
Click to show code
#difference the time seriesdf_producteur_ts_non_bio_diff <-diff(df_producteur_ts_non_bio)df_producteur_ts_bio_diff <-diff(df_producteur_ts_bio)#plot them to see the differentiationautoplot(df_producteur_ts_non_bio_diff)+labs(title ="Differenced Time Series of Organic Milk Prices")autoplot(df_producteur_ts_bio_diff) +labs(title ="Differenced Time Series of Bio Milk Prices")#check for stationarityadf.test(df_producteur_ts_non_bio_diff)#> Warning in adf.test(df_producteur_ts_non_bio_diff): p-value smaller#> than printed p-value#> #> Augmented Dickey-Fuller Test#> #> data: df_producteur_ts_non_bio_diff#> Dickey-Fuller = -6, Lag order = 4, p-value = 0.01#> alternative hypothesis: stationaryadf.test(df_producteur_ts_bio_diff)#> Warning in adf.test(df_producteur_ts_bio_diff): p-value smaller than#> printed p-value#> #> Augmented Dickey-Fuller Test#> #> data: df_producteur_ts_bio_diff#> Dickey-Fuller = -6, Lag order = 4, p-value = 0.01#> alternative hypothesis: stationary
The results of the ADF test for both differenced time series indicate:
Dickey-Fuller statistic value of -6
Lag order of 4
p-value of 0.01
In this case, the p-value is smaller than the typical significance level of 0.05, indicating strong evidence against the null hypothesis. Therefore, based on the p-values provided (0.01), we can conclude that the differenced time series data are likely stationary.
This suggests that after differencing, the time series data df_producteur_ts_non_bio and df_producteur_ts_bio have become stationary, which is often desirable for various time series analysis techniques and forecasting models.
7.1.2.2 Fitting the ARIMA Model and Forecasting
Click to show code
# Fit the ARIMA modelfit_non_bio <-auto.arima(df_producteur_ts_non_bio, seasonal =FALSE)fit_bio <-auto.arima(df_producteur_ts_bio, seasonal =FALSE)# Forecast the next 12 monthsforecast_non_bio <-forecast(fit_non_bio, h =12)forecast_bio <-forecast(fit_bio, h =12)#show the components used for the ARIMA modelfit_non_bio %>%summary()#> Series: df_producteur_ts_non_bio #> ARIMA(2,1,1) with drift #> #> Coefficients:#> ar1 ar2 ma1 drift#> 1.413 -0.636 -0.908 0.193#> s.e. 0.095 0.085 0.083 0.059#> #> sigma^2 = 1.11: log likelihood = -110#> AIC=231 AICc=232 BIC=242#> #> Training set error measures:#> ME RMSE MAE MPE MAPE MASE ACF1#> Training set -0.0319 1.02 0.789 -0.078 1.16 0.283 -0.118fit_bio %>%summary()#> Series: df_producteur_ts_bio #> ARIMA(0,1,1) #> #> Coefficients:#> ma1#> 0.616#> s.e. 0.088#> #> sigma^2 = 6.39: log likelihood = -178#> AIC=360 AICc=360 BIC=365#> #> Training set error measures:#> ME RMSE MAE MPE MAPE MASE ACF1#> Training set 0.0694 2.5 1.86 0.0626 2.18 0.677 -0.00305#plot the forecasted valuesautoplot(forecast_non_bio) +labs(title ="Forecasted Prices of Non-Organic Milk")
Click to show code
autoplot(forecast_bio) +labs(title ="Forecasted Prices of Organic Milk")
7.1.2.3 Fit a SARIMA Model
Click to show code
# Fit the SARIMA modelfit_non_bio_sarima <-auto.arima(df_producteur_ts_non_bio, seasonal =TRUE, stepwise =FALSE, approximation =FALSE)fit_bio_sarima <-auto.arima(df_producteur_ts_bio, seasonal =TRUE, stepwise =FALSE, approximation =FALSE)# Forecast the next 12 monthsforecast_non_bio_sarima <-forecast(fit_non_bio_sarima, h =12)forecast_bio_sarima <-forecast(fit_bio_sarima, h =12)#plot the forecasted valuesautoplot(forecast_non_bio_sarima) +labs(title ="Forecasted Prices of Non-Organic Milk (SARIMA)")autoplot(forecast_bio_sarima) +labs(title ="Forecasted Prices of Organic Milk (SARIMA)")
7.1.2.4 Compare ARIMA and SARIMA forecast
7.1.2.4.1 Organic Milk SARIMA vs ARIMA
Click to show code
# compare forecast_bio vs forecast_bio_sarima Model using AIC
7.1.2.4.2 Non-Organic Milk SARIMA vs ARIMA
Click to show code
# compare forecast_non_bio vs forecast_non_bio_sarima using AIC
7.1.2.5 Forecasted Prices ARIMA
Click to show code
# Create a table of the forecasted values forecast_table_arima <-tibble(Month =seq(as.Date("2023-01-01"), by ="month", length.out =12),Non_Organic_Forecast = forecast_non_bio$mean,Bio_Forecast = forecast_bio$mean)#round the forecasted valuesforecast_table_arima <- forecast_table_arima %>%mutate(across(c(Non_Organic_Forecast, Bio_Forecast), ~round(., 2)))#show the forecasted values using reactablereactable( forecast_table_arima, highlight =TRUE, # Highlight rows on hoverdefaultPageSize =10, # Display 10 rows per pagepaginationType ="numbers", # Use numbers for page navigationsearchable =TRUE, # Make the table searchablesortable =TRUE, # Allow sortingresizable =TRUE# Allow column resizing)
Click to show code
#plot the forecasted valuesforecast_table_arima %>%pivot_longer(cols =c(Non_Organic_Forecast, Bio_Forecast), names_to ="Type", values_to ="Forecasted_Price") %>%ggplot(aes(x = Month, y = Forecasted_Price, color = Type)) +geom_line() +labs(title ="Forecasted Prices of Organic and Non-Organic Milk",x ="Month",y ="Price",color ="Type") +theme_minimal()
We used the mean values of the forecasted prices for both organic and non-organic milk to create a table and plot the forecasted prices for the next 12 months. The table provides a detailed view of the forecasted prices, while the plot visualizes the trend of the forecasted prices over time.
7.1.2.6 Forecasted Prices SARIMA
Click to show code
# Create a table of the forecasted values forecast_table_sarima <-tibble(Month =seq(as.Date("2023-01-01"), by ="month", length.out =12),Non_Organic_Forecast = forecast_non_bio_sarima$mean,Bio_Forecast = forecast_bio_sarima$mean)#round the forecasted valuesforecast_table_sarima <- forecast_table_sarima %>%mutate(across(c(Non_Organic_Forecast, Bio_Forecast), ~round(., 2)))#show the forecasted values using reactablereactable( forecast_table_sarima, highlight =TRUE, # Highlight rows on hoverdefaultPageSize =10, # Display 10 rows per pagepaginationType ="numbers", # Use numbers for page navigationsearchable =TRUE, # Make the table searchablesortable =TRUE, # Allow sortingresizable =TRUE# Allow column resizing)
Click to show code
#plot the forecasted valuesforecast_table_sarima %>%pivot_longer(cols =c(Non_Organic_Forecast, Bio_Forecast), names_to ="Type", values_to ="Forecasted_Price") %>%ggplot(aes(x = Month, y = Forecasted_Price, color = Type)) +geom_line() +labs(title ="Forecasted Prices of Organic and Non-Organic Milk",x ="Month",y ="Price",color ="Type") +theme_minimal()
7.1.3 Exponential Smoothing
Click to show code
# Fit the ETS modelfit_non_bio_ets <-ets(df_producteur_ts_non_bio)fit_bio_ets <-ets(df_producteur_ts_bio)# Forecast the next 12 monthsforecast_non_bio_ets <-forecast(fit_non_bio_ets, h =12)forecast_bio_ets <-forecast(fit_bio_ets, h =12)#plot the forecasted valuesautoplot(forecast_non_bio_ets) +labs(title ="Forecasted Prices of Non-Organic Milk (ETS)")autoplot(forecast_bio_ets) +labs(title ="Forecasted Prices of Organic Milk (ETS)")
The Pareto Principle, often known as the 80/20 rule, asserts that a small proportion of causes, inputs, or efforts usually lead to a majority of the results, outputs, or rewards. Applied to a business context where approximately 20% of the sales account for 80% of the revenues, this principle can help in identifying and focusing on the most profitable aspects of a business.
Evidence from Research:
Sales and Customer Concentration: Research has consistently shown that a significant portion of sales often comes from a minority of customers or products. For instance, an analysis across 22 different consumer packaged goods categories found an average Pareto ratio (PR) of .73, indicating that the top proportion of products/customers often account for a disproportionately high share of sales or profits Source - Kim, Singh, & Winer, 2017
Decision Making and Resource Allocation: The Pareto Principle helps in decision-making by highlighting areas where the greatest impact can be achieved. For example, focusing on the top-performing products or customers can optimize resource allocation and maximize profits Source - Ivančić, 2014
Market and Profit Concentration: Another study noted that a small number of customers are often responsible for a large portion of sales, which supports the strategic focus on these customers to boost profitability and efficiency Source- McCarthy & Winer, 2018
Conclusion: Applying the Pareto Principle in a business context where a minority of sales drives the majority of revenue can lead to more focused and effective business strategies, optimizing efforts towards the most profitable segments. This approach not only simplifies decision-making but also enhances resource allocation, ultimately leading to increased profitability.
7.2.1.1 Steps
Calculating the total sales across all locations for both 2022 and 2023.
Ranking locations by sales to see the cumulative contribution of each location towards the total.
Identifying the point where approximately 20% of the locations contribute to around 80% of the sales.
Click to show code
# Calculate the total sales for each year and the combined total to apply Pareto Principlemerged_sales_data['Combined Sales'] = merged_sales_data['Total Sales 2022'] + merged_sales_data['Total Sales 2023']# Sort locations by combined salespareto_data = merged_sales_data.sort_values(by='Combined Sales', ascending=False)# Calculate cumulative salespareto_data['Cumulative Sales'] = pareto_data['Combined Sales'].cumsum()# Calculate the total of combined salestotal_combined_sales = pareto_data['Combined Sales'].sum()# Calculate the percentage of cumulative salespareto_data['Cumulative Percentage'] =100* pareto_data['Cumulative Sales'] / total_combined_sales# Find the point where about 20% of the locations contribute to approximately 80% of the salespareto_data['Location Count'] =range(1, len(pareto_data) +1)pareto_data['Location Percentage'] =100* pareto_data['Location Count'] /len(pareto_data)# Plotting the Pareto curveplt.figure(figsize=(12, 8))cumulative_line = plt.plot(pareto_data['Location Percentage'], pareto_data['Cumulative Percentage'], label='Cumulative Percentage of Sales', color='b', marker='o')plt.axhline(80.2, color='r', linestyle='dashed', linewidth=1)plt.axvline(33.3, color='green', linestyle='dashed', linewidth=1)plt.title('Pareto Analysis of Sales Across Locations')plt.xlabel('Cumulative Percentage of Locations')plt.ylabel('Cumulative Percentage of Sales')plt.legend()plt.grid(True)plt.show()
Given this graph 33.2% of Manor locations are contributing to 80% of sales. This deviates from the typical Pareto 80/20 distribution, but it still shows a concentration of sales among a subset of stores.
7.2.1.2 Observations
We will identify the top 33.3% of locations based on their cumulative sales contribution. This means selecting the smallest number of locations that together account for at least 80% of the total sales.
The top-performing 33.3% of Manor locations that contribute to the majority of sales are:
Click to show code
# Calculate the threshold for the top 33.3% of locationstop_third_index =int(len(pareto_data) *0.34)# Identifying the top 33.3% of stores contributing to at least 80% of salestop_performing_stores = pareto_data.head(top_third_index)top_performing_stores#> Location Total Sales 2022 ... Location Count Location Percentage#> 14 Monthey 49965 ... 1 3.703704#> 20 Sierre 32212 ... 2 7.407407#> 13 Marin-Epagnier 30228 ... 3 11.111111#> 23 Vevey 25720 ... 4 14.814815#> 10 Geneva 16068 ... 5 18.518519#> 21 Sion 15208 ... 6 22.222222#> 9 Fribourg 14792 ... 7 25.925926#> 11 Lausanne 13186 ... 8 29.629630#> 5 Chavannes-de-Bogis 14359 ... 9 33.333333#> #> [9 rows x 8 columns]
7.2.2 Understanding Success Factors of Top-Performing Stores
7.2.2.1 Correlating Political Parties with Milk Sales
Here, we will then make a scatterplot to identify if there is any correlation between any political party and sales of lait equitable. Our aim is to show that there might be a link with milk sales and a certain political party: are the sales correlated to a certain party presence? ::: {.cell layout-align=“center”}
Click to show code
# Calculate correlation coefficients for each partycorrelation_df <-data.frame(Party =c("PLR", "PS", "UDC", "Centre", "Verts", "Vertliberaux"),Correlation =sapply(merged_data[, 4:9], function(x) cor(x, merged_data$`2023`)))# Print the correlation coefficientsprint(correlation_df)#> Party Correlation#> PLR PLR 0.2796#> PS PS 0.1853#> UDC UDC -0.1933#> Centre Centre 0.0375#> Verts Verts 0.1708#> Vertliberaux Vertliberaux -0.3715# Create a matrix of plots for each partyparty_plots <-lapply(names(merged_data)[4:9], function(party) {ggplot(merged_data, aes_string(x ="`2023`", y = party)) +geom_point() +geom_smooth(method ="lm", se =FALSE, color ="blue") +labs(x ="Annual Sales", y =paste(party, "Party Presence (%)"), title =paste("Correlation:", party, "Party vs. Sales")) +theme_minimal()})#> Warning: `aes_string()` was deprecated in ggplot2 3.0.0.#> i Please use tidy evaluation idioms with `aes()`.#> i See also `vignette("ggplot2-in-packages")` for more information.# Arrange the plots in a matrix layoutmatrix_plot <- gridExtra::grid.arrange(grobs = party_plots, ncol =2)matrix_plot#> TableGrob (3 x 2) "arrange": 6 grobs#> z cells name grob#> 1 1 (1-1,1-1) arrange gtable[layout]#> 2 2 (1-1,2-2) arrange gtable[layout]#> 3 3 (2-2,1-1) arrange gtable[layout]#> 4 4 (2-2,2-2) arrange gtable[layout]#> 5 5 (3-3,1-1) arrange gtable[layout]#> 6 6 (3-3,2-2) arrange gtable[layout]
:::
We can now notice that there is no specific correlation between a certain political party and sales of Lait équitable.
We will therefore proceed to select only the Locations where a political party has more than 20% presence, and then sum all the sales per political party, to see which party has the most influence over sales.
Click to show code
# Read sales data from data.qmdsales_data_2023 <- sales_data %>%select(-`2022`)# Filter party data to keep only values above 20filtered_party_data <- party_data %>%filter(PLR >20| PS >20| UDC >20| Centre >20| Verts >20| Vertliberaux >20)# Create separate datasets for each political partyplr_data <- filtered_party_data %>%filter(PLR >20) %>%select(Location, PLR)ps_data <- filtered_party_data %>%filter(PS >20) %>%select(Location, PS)udc_data <- filtered_party_data %>%filter(UDC >20) %>%select(Location, UDC)centre_data <- filtered_party_data %>%filter(Centre >20) %>%select(Location, Centre)verts_data <- filtered_party_data %>%filter(Verts >20) %>%select(Location, Verts)vertliberaux_data <- filtered_party_data %>%filter(Vertliberaux >20) %>%select(Location, Vertliberaux)# Merge each party's data with sales data for 2023plr_sales <-merge(sales_data_2023, plr_data, by.x ="Location")ps_sales <-merge(sales_data_2023, ps_data, by.x ="Location")udc_sales <-merge(sales_data_2023, udc_data, by.x ="Location")centre_sales <-merge(sales_data_2023, centre_data, by.x ="Location")verts_sales <-merge(sales_data_2023, verts_data, by.x ="Location")vertliberaux_sales <-merge(sales_data_2023, vertliberaux_data, by.x ="Location")# Calculate total sales for each partyplr_total_sales <-sum(plr_sales$`2023`)ps_total_sales <-sum(ps_sales$`2023`)udc_total_sales <-sum(udc_sales$`2023`)centre_total_sales <-sum(centre_sales$`2023`)verts_total_sales <-sum(verts_sales$`2023`)# Create a data frame for total sales by partytotal_sales_df <-data.frame(Party =c("PLR", "PS", "UDC", "Centre", "Verts"),Total_Sales =c(plr_total_sales, ps_total_sales, udc_total_sales, centre_total_sales, verts_total_sales))# Define colors for each partyparty_colors <-c("PLR"="blue", "PS"="red", "UDC"="darkgreen", "Centre"="orange", "Verts"="green")# Sort the data frame by Total_Sales in descending ordertotal_sales_df <- total_sales_df[order(-total_sales_df$Total_Sales), ]# Plot total sales by party in descending order with specified colorsggplot(total_sales_df, aes(x =reorder(Party, -Total_Sales), y = Total_Sales, fill = Party)) +geom_bar(stat ="identity") +labs(title ="Total Sales by Political Party in 2023", x ="Party", y ="Total Sales") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1)) +scale_fill_manual(values = party_colors)
We can see that there actually is a lot of sales of lait équitable where PS is present more than 20%
7.2.2.2 Correlating average revenue with Milk Sales
Now, we want to see if there is some correlation between the income per taxpayer of a commune and its sales. ::: {.cell layout-align=“center”}
Click to show code
# Create a scatterplotggplot(merged_df, aes(x =`Revenu/contribuable`, y =`2022`)) +geom_point(aes(color = Location)) +labs(x ="Revenu/contribuable", y ="Sales 2022", title ="Relationship between Revenu/contribuable and Sales in 2022") +geom_smooth(method ="lm", se =FALSE) +theme_minimal()# Create another scatterplot for 2023ggplot(merged_df, aes(x =`Revenu/contribuable`, y =`2023`)) +geom_point(aes(color = Location)) +labs(x ="Revenu/contribuable", y ="Sales 2023", title ="Relationship between Revenu/contribuable and Sales in 2023") +geom_smooth(method ="lm", se =FALSE) +theme_minimal()
::: Here we can again see that there is no specific correlation, and that the correlation is even negative. We will then move on to our last part where we will try to correlate the sales of lait équitable with the proximity of milk producers.
7.2.2.3 Correlating Distance to Producers
Does being close to a Producer influence Sales ? What we do here is to calculate a distance matrix in order to determine the proximity of each city to the producers. We then analyze the correlation between the total sales and the minimum distance to the producer. This analysis helps us understand if the proximity to the producer has any significant impact on the sales of Lait Equitable products.
The distance Matrix is shown as follow : ::: {.cell layout-align=“center”}
Click to show code
from geopy.distance import geodesiclatitudes = []longitudes = []# Parse each location and extract latitude and longitudefor location in locations: lat, lon = location.split(',') latitudes.append(float(lat)) longitudes.append(float(lon))# Create a DataFrame using pandasproducers = pd.DataFrame({'Latitude': latitudes,'Longitude': longitudes})#get the df from the python codecities = df['City']# Initialize an empty DataFrame to store distancesdistance_matrix = pd.DataFrame(index=cities, columns=[f"Producer {i+1}"for i inrange(len(producers))])# Calculate distances and fill the DataFramefor city in cities: city_lat, city_lon = get_lat_lon(city)if city_lat isnotNoneand city_lon isnotNone: city_coords = (city_lat, city_lon)for index, producer in producers.iterrows(): producer_coords = (producer['Latitude'], producer['Longitude']) distance = geodesic(city_coords, producer_coords).kilometers # distance in kilometers distance_matrix.loc[city, f"Producer {index+1}"] = distance# Flatten the DataFrame to get all distance values in one seriesall_distances = distance_matrix.values.flatten()
:::
Click to show code
#get the distance_matrix in rdistance_matrix <- py$distance_matrix#use reactable to show the table #show it using reactablereactable( distance_matrix, highlight =TRUE, # Highlight rows on hoverdefaultPageSize =10, # Display 10 rows per pagepaginationType ="numbers", # Use numbers for page navigationsearchable =TRUE, # Make the table searchablesortable =TRUE, # Allow sortingresizable =TRUE# Allow column resizing)
We can see here the distribution of distances between cities and producers. The histogram shows the frequency of distances between cities and producers, providing insights into the geographical distribution of producers and their proximity to cities.
Statistic
Distance
Minimum Distance
0.68 km
Maximum Distance
283.46 km
Average Distance
116.73 km
Median Distance
117.62 km
Click to show code
# Basic statisticsprint("Distance Statistics:")#> Distance Statistics:print("Minimum Distance: {:.2f} km".format(all_distances.min()))#> Minimum Distance: 0.68 kmprint("Maximum Distance: {:.2f} km".format(all_distances.max()))#> Maximum Distance: 283.30 kmprint("Average Distance: {:.2f} km".format(all_distances.mean()))#> Average Distance: 116.73 kmprint("Median Distance: {:.2f} km".format(np.median(all_distances)))#> Median Distance: 117.62 km# Histogram of the distancesplt.figure(figsize=(10, 6))plt.hist(all_distances, bins=30, color='#24918d', alpha=0.7)plt.title('Distribution of Distances Between Cities and Producers')plt.xlabel('Distance in km')plt.ylabel('Frequency')plt.grid(True)plt.show()
We see that the correlation is non-existent between the total sales and the distance to the producer. This suggests that the proximity to the producer does not significantly influence the sales of Lait Equitable products in the analyzed locations.
Correlation between Total Sales 2022 and Min Distance to Producer:
Correlation between Total Sales 2023 and Min Distance to Producer:
1.000000
0.014807
0.014807
1.000000
Click to show code
df_sales = r.get('df_merged_sales')#rename 'Location' column as 'City'df_sales.rename(columns={'Location': 'City'}, inplace=True)df_sales.set_index('City', inplace=True)# Calculate the minimum distance for each city and add it to df_salesdf_sales['Min Distance to Producer'] = distance_matrix.min(axis=1)# Calculate Pearson correlationcorrelation_2022 = df_sales[['Total Sales 2022', 'Min Distance to Producer']].corr(method='pearson')correlation_2023 = df_sales[['Total Sales 2023', 'Min Distance to Producer']].corr(method='pearson')print("Correlation between Total Sales 2022 and Min Distance to Producer:")#> Correlation between Total Sales 2022 and Min Distance to Producer:print(correlation_2022)#> Total Sales 2022 Min Distance to Producer#> Total Sales 2022 1.000000 0.014739#> Min Distance to Producer 0.014739 1.000000print("Correlation between Total Sales 2023 and Min Distance to Producer:")#> Correlation between Total Sales 2023 and Min Distance to Producer:print(correlation_2023)#> Total Sales 2023 Min Distance to Producer#> Total Sales 2023 1.000000 0.033715#> Min Distance to Producer 0.033715 1.000000# Convert 'Total Sales 2022' and 'Min Distance to Producer' to numeric types explicitlydf_sales['Total Sales 2022'] = pd.to_numeric(df_sales['Total Sales 2022'], errors='coerce')df_sales['Min Distance to Producer'] = pd.to_numeric(df_sales['Min Distance to Producer'], errors='coerce')# Plotting Total Sales 2022 vs. Min Distance to Producerplt.figure(figsize=(10, 6))sns.regplot( x='Min Distance to Producer', y='Total Sales 2022', data=df_sales, scatter_kws={'s': 50, 'color': '#7e57c2'}, # Customizing the scatter plot points line_kws={'color': '#33848D', 'lw': 2} # Customizing the regression line)plt.title('Total Sales 2022 vs. Minimum Distance to Producer')plt.xlabel('Minimum Distance to Producer (km)')plt.ylabel('Total Sales 2022')plt.grid(True)plt.show()
Click to show code
# Plot for 2023plt.figure(figsize=(10, 6))sns.regplot(x='Min Distance to Producer', y='Total Sales 2023', data=df_sales, scatter_kws={'s': 50, 'color': '#33848D'}, line_kws={'color': '#7e57c2'})plt.title('Total Sales 2023 vs. Min Distance to Producer')plt.xlabel('Minimum Distance to Producer (km)')plt.ylabel('Total Sales 2023')plt.grid(True)plt.show()
7.2.2.4 Diving Deeper
To dive a bit deeper into the insights on how the sales of each manor is influenced, we’ll write a Python script using the pandas library. The goal is to iterate through each city (column in the distance_matrix DataFrame), find the producer (row) with the minimum distance for that city, and then tally the number of times each producer is the closest to any city. Finally, we’ll create a DataFrame to display the number of times each producer was closest to a city.
Here is a step-by-step guide and the corresponding code:
Import the pandas library. Load the data into a DataFrame: We’ll assume the data you provided is in a CSV or Excel file. If it’s in another format, you can adjust the loading method accordingly. Initialize a DataFrame to keep track of the scores for each producer. Iterate through each column (city) in the distance_matrix_t DataFrame, find the index of the minimum distance, and update the score for the respective producer. Display the final DataFrame with the scores.
Click to show code
# Initialize a DataFrame to store scores for each producerscores = pd.DataFrame(0, index=distance_matrix.index, columns=['Score'])# Iterate through each city (column) to find the producer with the minimum distancefor city in distance_matrix.columns: min_distance_producer = distance_matrix[city].idxmin() scores.loc[min_distance_producer, 'Score'] +=1#merges scores with df_salesdf = df_sales.merge(scores, left_index=True, right_index=True, how='left')correlation_matrix = df[['Total Sales 2022', 'Total Sales 2023', 'Score']].corr()# Display the correlation matrixcorrelation_matrix#> Total Sales 2022 Total Sales 2023 Score#> Total Sales 2022 1.000000 0.994222 0.250883#> Total Sales 2023 0.994222 1.000000 0.268046#> Score 0.250883 0.268046 1.000000
Here is the correlation matrix for the total sales in 2022, total sales in 2023, and the score calculated based on the proximity of each producer to the cities
Total Sales 2022
Total Sales 2023
Score
Total Sales 2022
1.000000
0.994222
0.250883
Total Sales 2023
0.994222
1.000000
0.268046
Score
0.250883
0.268046
1.000000
It is better than the minimum distance correlation, but it is still not very high. This suggests that the proximity of a producer to a city, as measured by the score, has a moderate positive correlation with the total sales in both 2022 and 2023.
Click to show code
# Plotting 'Sales 2022' vs 'Score'plt.figure(figsize=(10, 5))plt.subplot(1, 2, 1) # 1 row, 2 columns, 1st subplotsns.regplot(x='Total Sales 2022', y='Score', data=df, ci=None, scatter_kws={'color': '#7e57c2'}, line_kws={'color': '#33848D'})plt.title('Sales 2022 vs. Score')# Plotting 'Sales 2023' vs 'Score'plt.subplot(1, 2, 2) # 1 row, 2 columns, 2nd subplotsns.regplot(x='Total Sales 2023', y='Score', data=df, ci=None, scatter_kws={'color': '#33848D'}, line_kws={'color': '#7e57c2'})plt.title('Sales 2023 vs. Score')# Show the plotsplt.tight_layout()plt.show()